Deep diving into a major mode - Part 1

Tue, Oct 3, 2017 emacs

I've taken up maintaining hy-mode - a major mode for lispy python.

I narrate working through specific problems in auto-completion, indentation, shell integration, and so on.

This post touches on: syntax, indentation, font-locking, and context-sensitive syntax.

All code snippets require the Emacs packages dash and s.

Syntax Tables

The first step in a major mode is the syntax table.

In any major mode run describe-syntax to see its syntax table. As we are working with a lisp, we copy its syntax-table to start with.

(defconst hy-mode-syntax-table
  (-let [table
         (copy-syntax-table lisp-mode-syntax-table)]
    ;; syntax modifications...
    table)
  "Hy modes syntax table.")

The syntax table isn't set explicitly, its name identifies and sets it for hy-mode.

Configuration is performed with modify-syntax-entry, its docstring provides all the syntactic constructs we can pick from.

A subset to be familiar with:

( ) : open/close parenthesis. These are for all bracket-like constructs such as [ ] or { }. The first character should be the syntactic construct, namely "(" or ")", and the second character should be the closing delimiter.

(modify-syntax-entry ?\{ "(}" table)
(modify-syntax-entry ?\} "){" table)
(modify-syntax-entry ?\[ "(]" table)
(modify-syntax-entry ?\] ")[" table)

' : prefix character. Prefixes a symbol/word.

;; Quote characters are prefixes
(modify-syntax-entry ?\~ "'" table)
(modify-syntax-entry ?\@ "'" table)

_ and w : symbol and word constituent respectively.

;; "," is a symbol in Hy, namely the tuple constructor
(modify-syntax-entry ?\, "_ p" table)

;; "|" is a symbol in hy, naming the or operator
(modify-syntax-entry ?\| "_ p" table)

;; "#" is a tag macro, we include # in the symbol
(modify-syntax-entry ?\# "_ p" table)

: generic string fence. A more general string quote syntactic construct.

Used for delimiting multi-line strings like with triple quotes in Python. I go into depth on this construct in the "context-sensitive syntax" section.

Indentation

Look through calculate-lisp-indent, the indentation workhorse of lisp-mode derivatives, and it is quickly seen that indentation is hard.

Indentation is set with indent-line-function.

In the case of a lisp, we actually do:

(setq-local indent-line-function 'lisp-indent-line)
(setq-local lisp-indent-function 'hy-indent-function)

Where the real work is performed by calculate-lisp-indent that makes calls to lisp-indent-function, accepting an indent-point and state.

The function at heart is parse-partial-sexp, taking limiting points and retrieving a 10 element list describing the syntax at the point.

As this is a (necessarily) excessive amount of information, I recommend as many other modes have done - define some aliases. I have:

(defun hy--sexp-inermost-char (state) (nth 1 state))
(defun hy--start-of-last-sexp (state) (nth 2 state))
(defun hy--in-string? (state) (nth 3 state))
(defun hy--start-of-string (state) (nth 8 state))

Observe you can also omit state and call syntax-ppss to get state which runs parse-partial-sexp from point-min to current point, with the caveat that the 2nd and 6th state aren't reliable. I prefer to pass the state manually.

These are the building blocks for indentation - we can then write utilities to better get our head around indentation like:

(defun hy--prior-sexp? (state)
  (number-or-marker-p (hy--start-of-last-sexp state)))

The indent function

The three cases:

;; Normal Indent
(normal b
        c)
(normal
  b c)

;; Special Forms
(special b
  c)

;; List-likes
[a b
 c]

Hy's current indent function:

(defun hy-indent-function (indent-point state)
  "Indent at INDENT-POINT where STATE is `parse-partial-sexp' for INDENT-POINT."
  (goto-char (hy--sexp-inermost-char state))

  (if (hy--not-function-form-p)
      (1+ (current-column))  ; Indent after [, {, ... is always 1
    (forward-char 1)  ; Move to start of sexp

    (cond ((hy--check-non-symbol-sexp (point))  ; Comma tuple constructor
           (+ 2 (current-column)))

          ((hy--find-indent-spec state)  ; Special form uses fixed indendation
           (1+ (current-column)))

          (t
           (hy--normal-indent calculate-lisp-indent-last-sexp)))))

When we indent we jump to the sexp's innermost char, ie. "(", "[", "{", etc..

If that character is a list-like, then we 1+ it and are done.

Otherwise we move to the start of the sexp and investigate if (thing-at-point 'symbol). If it is, then we check a list of special forms like when, do, defn for a match. If we found a (possibly fuzzy) match, then regardless of whether the first line contains args or not, we indent the same.

(defun hy--normal-indent (last-sexp)
  "Determine normal indentation column of LAST-SEXP.

Example:
 (a (b c d
       e
       f))

1. Indent e => start at d -> c -> b.
Then backwards-sexp will throw error trying to jump to a.
Observe 'a' need not be on the same line as the ( will cause a match.
Then we determine indentation based on whether there is an arg or not.

2. Indenting f will go to e.
Now since there is a prior sexp d but we have no sexps-before on same line,
the loop will terminate without error and the prior lines indentation is it."
  (goto-char last-sexp)
  (-let [last-sexp-start nil]
    (if (ignore-errors
          (while (hy--anything-before? (point))
            (setq last-sexp-start (prog1
                                      ;; Indentation should ignore quote chars
                                      (if (-contains? '(?\' ?\` ?\~)
                                                      (char-before))
                                          (1- (point))
                                        (point))
                                    (backward-sexp))))
          t)
        (current-column)
      (if (not (hy--anything-after? last-sexp-start))
          (1+ (current-column))
        (goto-char last-sexp-start)  ; Align with function argument
        (current-column)))))

Normal indent does the most work. To notice, if we are on the next line without a function arg above, then last-sexp-start will be nil as backward-sexp will throw an error and the setq won't go off.

If there is a function call above, then the current-column of the innermost, non-opening sexp, will end up as the indent point.

If we indent the line of the funcall, it will jump to the containing sexp and calculate its indent.

Other indentation functions are a bit more advanced in that they track the number of prior sexps in the indent-function to distinguish between eg. the then and else clause of an if statement. Those cases use the same fundamentals that are seen here.

Developing indentation from scratch can be challenging. The approach I took was to look at clojure's indentation and trim it down until it fit this language. I've removed most of the extraneous details that it adds to handle special rules for eg. clojure.spec but it is still possible that I could trim this further.

Font Locks and Highlighting

Two symbols are the entry points to be aware of into font locking: hy-font-lock-kwds and hy-font-lock-syntactic-face-function.

(setq font-lock-defaults
        '(hy-font-lock-kwds
          nil nil
          (("+-*/.<>=!?$%_&~^:@" . "w"))  ; syntax alist
          nil
          (font-lock-mark-block-function . mark-defun)
          (font-lock-syntactic-face-function  ; Differentiates (doc)strings
           . hy-font-lock-syntactic-face-function)))

Font lock keywords

There exists many posts on modifying the variable font-lock-keywords.

The approach taken in hy-mode is to separate out the language by category:

(defconst hy--kwds-constants
  '("True" "False" "None" "Ellipsis" "NotImplemented")
  "Hy constant keywords.")

(defconst hy--kwds-defs
  '("defn" "defun"
    "defmacro" "defmacro/g!" "defmacro!"
    "defreader" "defsharp" "deftag")
  "Hy definition keywords.")

(defconst hy--kwds-operators
  '("!=" "%" "%=" "&" "&=" "*" "**" "**=" "*=" "+" "+=" "," "-"
    "-=" "/" "//" "//=" "/=" "<" "<<" "<<=" "<=" "=" ">" ">=" ">>" ">>="
    "^" "^=" "|" "|=" "~")
  "Hy operator keywords.")

;; and so on

And then use the amazing rx macro for constructing the regexes.

Now due to rx being a macro and its internals, in order to use variable definitions in the regex construction we have to call rx-to-string instead.

The simplest definition:

(defconst hy--font-lock-kwds-constants
  (list
   (rx-to-string
    `(: (or ,@hy--kwds-constants)))

   '(0 font-lock-constant-face))

  "Hy constant keywords.")

A more complex example with multiple groups taking different faces:

(defconst hy--font-lock-kwds-defs
  (list
   (rx-to-string
    `(: (group-n 1 (or ,@hy--kwds-defs))
        (1+ space)
        (group-n 2 (1+ word))))

   '(1 font-lock-keyword-face)
   '(2 font-lock-function-name-face nil t))

  "Hy definition keywords.")

Of course not all highlighting constructs are determined by symbol name. We can highlight the shebang line for instance as:

(defconst hy--font-lock-kwds-shebang
  (list
   (rx buffer-start "#!" (0+ not-newline) eol)

   '(0 font-lock-comment-face))

  "Hy shebang line.")

We then collect all our nice and modular font locks as hy-font-lock-kwds that we set earlier:

(defconst hy-font-lock-kwds
  (list hy--font-lock-kwds-constants
        hy--font-lock-kwds-defs
        ;; lots more ...
        hy--font-lock-kwds-shebang)

  "All Hy font lock keywords.")

Syntactic face function

This function is typically used for distinguishing between string, docstrings, and comments. It does not need to be set unless you want to distinguish docstrings.

(defun hy--string-in-doc-position? (state)
  "Is STATE within a docstring?"
  (if (= 1 (hy--start-of-string state))  ; Identify module docstring
      t
    (-when-let* ((first-sexp (hy--sexp-inermost-char state))
                 (function (save-excursion
                             (goto-char (1+ first-sexp))
                             (thing-at-point 'symbol))))
      (s-matches? (rx "def" (not blank)) function))))  ; "def"=="setv"

(defun hy-font-lock-syntactic-face-function (state)
  "Return syntactic face function for the position represented by STATE.
STATE is a `parse-partial-sexp' state, and the returned function is the
Lisp font lock syntactic face function. String is shorthand for either
a string or comment."
  (if (hy--in-string? state)
      (if (hy--string-in-doc-position? state)
          font-lock-doc-face
        font-lock-string-face)
    font-lock-comment-face))

It is rather straightforward - we start out within either a string or comment. If needed, we jump to the first sexp and see if it is a "def-like" symbol, in which case we know its a doc.

This implementation isn't perfect as any string with a parent def-sexp will use the doc-face, so if your function returns a raw string, then it will be highlighted as if its a doc.

Context sensitive syntax

An advanced feature Emacs enables is context-sensitive syntax. Some examples are multi-line python strings, where there must be three single quotes together, or haskell's multiline comments.

Hy implements multiline string literals for automatically escaping quote characters. The syntax is #[optional-delim[the-string]optional-delim] where the string can span lines.

In order to identify and treat the bracket as a string, we look to setting the syntax-propertize-function.

It takes two arguments, the start and end points with which to search through. syntax.el handles the internals of limiting and passing the start and end and applying/removing the text properties as the construct changes.

(defun hy--match-bracket-string (limit)
  "Search forward for a bracket string literal."
  (re-search-forward
   (rx "#["
       (0+ not-newline)
       "["
       (group (1+ (not (any "]"))))
       "]"
       (0+ not-newline)
       "]")
   limit
   t))

(defun hy-syntax-propertize-function (start end)
  "Implements context sensitive syntax."
  (save-excursion
    (goto-char start)

    ;; Start goes to current line, need to go to char-before the #[ block
    (when (nth 1 (syntax-ppss))
      (goto-char (- (hy--sexp-inermost-char (syntax-ppss)) 2)))

    (while (hy--match-bracket-string end)
      (put-text-property (1- (match-beginning 1)) (match-beginning 1)
                         'syntax-table (string-to-syntax "|"))

      (put-text-property (match-end 1) (1+ (match-end 1))
                         'syntax-table (string-to-syntax "|")))))

We go to the start and jump before its innermost containing sexp begins minus two for the hash sign and bracket characters.

If the regex matches a bracket string, we then set the innermost brackets on both sides to have the string-fence syntax.

When the syntax is set - parse-partial-sexp and in particular font lock mode and indent-line will now recognize that block as a string - so proper indentation and highlighting follow immediately. And when we modify the brackets, the string-fence syntax is removed and behaves as expected.

This function can handle any kind of difficult syntactic constructs. For instance, I could modify it to only work if the delimiters on both side of the bracket string are the same. I could also associate some arbitrary, custom text property that other parts of hy-mode interact with.

Note that there is the macro syntax-propertize-rules for automating the searching and put-text-property portions. I prefer to do the searching and application manually to 1. have more flexibility and 2. step through the trace easier.

Closing

Building a major mode teaches a lot about how Emacs works. I'm sure I've made errors, but so far this has been enough to get hy-mode up and running. The difference in productivity in Hy I've enjoyed since taking maintainer-ship has made the exercise more than worth it.

I also have auto-completion and shell/process integration working which I'll touch on in future posts.