Advanced Syntax Highlighting - Variable Assignments

Fri, Apr 13, 2018 lisp, emacs

Variable assignments in Emacs lisp are not highlighted.

That is, (setq foo bar) will not apply font-lock-variable-name-face to foo. This case is easy enough to implement via regex.

But in Emacs lisp, assignment is variadic, accepting alternating name-value pairs.

(setq foo bar
      foo (more stuff) foo bar)

How can we hope to highlight all the foo?

In this post I demonstrate syntax-traversing highlighting, as applied to variable assignments.

/img/setv-highlighting-example.png

Plus and minus denote the name/value pairs. Names that are forms are not highlighted, as desired.

In this post I reference a setv instead of setq. Snippets depend on dash and smartparens.

Syntax Highlighting

Intro

Font locks are widely used in Emacs, responsible for most syntax highlighting.

Typically regexes and faces are paired together. The single pair case can be handled by adding the following list to font-lock-keywords.

(setq setv-rgx
      (rx symbol-start "setv" symbol-end (1+ space) (group (1+ word))))
(setq setv-font-lock-kwd
      (list setv-rgx '(1 font-lock-variable-name-face)))

See <a href='/post/major-mode-part-1/'>my post on writing a major mode</a> for more examples of font-locking.

For most all cases, regexes are sufficient.

Recursive Highlighting

Lets look at the smaller case of highlighting every other word occuring after setv in:

(setv foo bar foo bar foo bar)

We have an indeterminate number of alternating pairs.

For this usecase, font-lock-keywords exposes something called "match anchors". That is, we anchor on some initial match, the setv, and then repeatedly try another match.

Lets update our font lock keyword to match every other word:

(setq every-other-word-rgx
      (rx (1+ word) (1+ not-wordchar) (group (1+ word))))

(add-to-list 'setv-font-lock-kwd
             (list every-other-word-rgx
                   nil nil
                   '(1 font-lock-variable-name-face)))

Now we have every other word highlighting.

Multiline Highlighting

Lets extend our example to span multiple lines:

(setv foo bar foo bar foo bar
      foo bar)

Font lock mode doesn't run over multiple lines by default. We can tell font lock mode to allow line-spanning highlights by setting font-lock-multiline to true.

But this isn't enough. How would it know when to stop finding every other word?

The two nil values in our keyword were for the PRE-MATCH-FORM and POST-MATCH-FORM. These allow extra flexibility over regexes for moving the point around during the traversal. Additionally, the PRE-MATCH-FORM can return a point, which is used as the limit to the anchor.

So lets define a trivial pre-match function that tells font lock mode to check the next line when the anchor setv is encountered.

(defun setv-pre-match-form ()
  (forward-line))

(add-to-list 'setv-font-lock-kwd
             (list every-other-word-rgx
                   '(setv-pre-match-form) nil
                   '(1 font-lock-variable-name-face)))

Closer now, the example works and can be adjusted quite easily to determine the right number of lines to move forward.

But, performing edits on one line can cause inconsistent changes or even lose highlighting entirely on other lines.

What is going wrong?

Font Lock Regions

Editing within an assignment can cause the search for the anchored setv to occur from any point within the form. So finding the anchor will be unreliable.

Naturally the thought is: what if we traverse to the beginning of the form in the setv-pre-match-form so we always catch the match?

This turns out to fail as we might encounter multiple start/end combinations each within the same setv form, whom will buggily interact, overwrite, and possibly miss names entirely.

The arcane font-lock-extend-region-functions is responsible for setting the begin and end search regions of multiline fontifications.

Its documentation puts it well:

Its most common use is to solve the problem of identification of multiline elements by providing a function that tries to find such elements and move the boundaries such that they do not fall in the middle of one.

Promising!

Before we dive into it, lets understand the other remaining highlighting methods.

Font Locking with Functions

The MATCHER is the first form in a font lock keyword. The previous examples have it taking the value of a regex.

It can also be a function of one argument, a limiting point, that sets the match-data just as a regexp would, returning true if a match occurred.

The following would be equivalent to having setv-rgx as the MATCHER.

(defun match-setv (limit)
  (re-search-forward setv-rgx limit t))

But now we can do a lot more.

Lets restrict to matching setv that are only one parenthesis deep.

(defun match-setv (limit)
  (and (re-search-forward setv-rgx limit t)
       (= 1 (nth 0 (syntax-ppss)))))

This matcher performs highlighting conditional on the syntax!

We now have the building blocks of syntax-traversing highlighting.

Solution

A fully self-contained setv-mode to try out:

(setq setv-rgx (rx symbol-start "setv" symbol-end (1+ space) (group (1+ word))))
(setq setv-current-depth nil)

(defun setv-font-lock-extend-region ()
  "Extend assignment forms' regions, see `font-lock-extend-region-functions'."
  (save-excursion
    (let ((start-beg font-lock-beg)
          (start-end font-lock-end)
          (depth (nth 0 (syntax-ppss))))
      (when (and (< 0 depth)
                 (sp-beginning-of-sexp)
                 (string= "setv" (thing-at-point 'symbol)))

        (setq setv-current-depth depth)

        (setq font-lock-beg (1- (point)))
        (sp-end-of-sexp)
        (setq font-lock-end (1+ (point)))

        (or (/= start-beg font-lock-beg)  ; Signal possible changes to font-lock
            (/= start-end font-lock-end))))))

(defun setv-match-assignments (limit)
  "Recursively set `match-data' assignment names containing point until LIMIT.

`setv-font-lock-extend-region' prepares this function to:
1. Not traverse the same assignment form twice.
2. Have the initial call at form's start and passed limit at form's end.

The first name in each assignment is highlighted via a standard regex, so as to
keep the initial condition simple."
  (-when-let* ((start (point))
               (_ (sp-beginning-of-sexp))
               (_ (re-search-forward setv-rgx limit t)))
    (when (> start (point))  ; Resume traversal at last symbol
      (goto-char start))

    (sp-forward-sexp)

    (when (< (point) limit)
      (setq matched-word? (re-search-forward (rx (group (1+ word))) limit t))
      (setq descended? (and setv-current-depth
                            (> (nth 0 (syntax-ppss))
                               setv-current-depth)))

      (or (and matched-word? descended?
               (sp-up-sexp)
               (setv-match-assignments limit))
          matched-word?
          (setv-match-assignments limit)))))

(define-derived-mode setv-mode lisp-mode "Setv"
  (setq font-lock-multiline t)
  (add-to-list 'font-lock-extend-region-functions
               'setv-font-lock-extend-region)

  (setq setv-font-lock-kwds
        `((setv-match-assignments 1 font-lock-variable-name-face)
          (,setv-rgx 1 font-lock-variable-name-face)))

  (setq font-lock-defaults
        '(setv-font-lock-kwds
          nil nil
          (("+-*/.<>=!?$%_&~^:@" . "w"))
          nil nil
          (font-lock-mark-block-function . mark-defun))))

I collapsed it into a major mode to allow for M-x setv-mode to try out the highlighting yourself.

Lets break down what is occurring in each step:

Extending the region

We check if the form-opener containing point is an assignment.

If it is we must conform to font-lock-mode's bookkeeping by:

Setting the dynamically bound font-lock-beg and font-lock-end to the desired start/end of the form, for only assignment forms.
Tracking the depth of the assignment. The region expansion occurs once per assignment while the searching is recursive, so we set the depth at expansion-time.
Return whether the start or end changed during the region expansion.

Searching for assignments

Extending the region leaves us with the current point at the assignment form's opening and the limit at its close, and we will not restart the search from somewhere else within the form.

But we don't know whether the form is an assignment, we only know that the bounds are correct in the case that it is.

So first we check that the region we are considering is an assignment. We jump past one sexp, namely the value, and set match-data to the following with a regex search, as required by font-lock internals.

Now this match doesn't consider syntax, unlike the first jump. We check that we didn't just move forward into an embedded form. If we did, we need to skip this pair as we both do not want to highlight the form, and it would interfere with the sp-beginning-of-sexp on future calls. So we jump out and recurse.

Conclusions

The example mode demonstrates a particularly difficult form of syntax-highlighting and pulls together many more advanced features of Emac's font-lock-mode.

However there are still issues:

There is a performance cost to multiline highlighting, as noted in its documentation. How significant the impact is something I do not understand well yet.
While the names that are highlighted appear to be correct, application of highlighting to every name at all times is still inconsistent and might require edits on nearby parts of the buffer to take effect. My hunch is to investigate the other two region extension functions.

Altogether I'm once again impressed at the flexibility Emacs offers to tailor the display of text to your liking.