Reactive templating: Lexing

This is the first part of a three-part series.

In this series I want to explore a solution for reactively binding a model to a view. This series will be light on code and heavy on concept. I’d like to cover the how and why instead of dumping hundreds of lines of code into the post.

The goal

I have an element in the DOM with this expression in it:

1<div>{{ greeting + ", " + person.name + ("!" * excitement) }}</div>

A template expression, if you will.

The goal is to end up with a set of tokens like this:

 1[
 2  { type: "variable", value: "greeting" },
 3  { type: "operator", value: "+" },
 4  { type: "string", value: ", " },
 5  { type: "variable", value: "person.name" },
 6  { type: "operator", value: "+" },
 7  { type: "parenOpen", value: "(" },
 8  { type: "string", value: "!" },
 9  { type: "operator", value: "*" },
10  { type: "variable", value: "excitement" },
11  { type: "parenClose", value: ")" },
12]

How do we accomplish this?

There are a few common approaches that I’ve seen to this problem. Probably the most obvious involves pattern matching with regular expressions. On my first attempt at this, that’s exactly what I did. It started out fine, but a number of issues with this quickly became evident.

The RegEx approach

I wrote a regular expression to split the template expressions at:

Easy enough. Sure. But what if I want the ability to support object property references with bracket notation?

1<div>{{ greeting + ", " + person["first-name"] }}</div>

My regular expression in its current form was pretty useless at this point. The output was something like:

1[ 'greeting', '+', '", "', '+', 'person[', '"first', '-', 'name"]' ]

Not great.

I could have done a second pass at the result array, but it’s not easy to extrapolate what the token type should be given the output that I had. Assumptions would need to be made, and that seems prone to failure. Probably not a good idea.

A more obvious choice is to modify the existing regular expression to add some new cases:

  1. Don’t match operators if they’re encased in quotes, because they’re just part of a string, e.g. classes["bg-gray-400"] <– don’t split on those hyphens.
  2. Don’t match something that begins with quotes if it immediately follows brackets, because that’s a property in bracket notation, e.g. person["name"].

This calls for some negative lookaheads and probably negative lookbehinds. This makes the regular expression much longer, even more indecipherable and mostly just a maintenance burden.

What if I decide later on that I want to support template literals inside of the template expressions?

1{{ `The coin landed on: ${Math.random() > 0.5 "heads" : "tails"} !` }}

The regular expression will likely need to be reworked. Not fun.

There has to be a better way.

Lexing

A standard lexer operates on its input character by character, generating tokens along the way. It doesn’t handle the tokens. It just creates them. This is also sometimes called tokenization.

To reiterate my goals, here’s what I want to support:

  1. Basic interpolation.
1<div>{{name}}'s not here, man.</div>
  1. Chained filters that allow arguments.
1<div>{{ name | uppercase(true) | randomGreet }}</div>
  1. Typical JavaScript expressions. Filters still supported.
1<div>{{ Math.random() > 0.5 ? "Heads" : "Tails" | decideWinner }}</div>
  1. Template literals
1<div>{{ `The coin landed on: ${Math.random() > 0.5 "heads" : "tails"} !` }}</div>

Base lexer

Without any of the our use-case-specific logic in place, the lexer is actually pretty simple. Its constructor accepts a string, it has some methods to move our cursor position around that string, and a method to emit tokens as we create them. For our purpose, that’s all we need.

 1export class Lexer {
 2  // This is the string we're lexing.
 3  str: string;
 4
 5  // We'll be emitting tokens as we discover them.
 6  emitter: EventTarget;
 7
 8  // This is our current position in the string.
 9  cursor: number = 0;
10
11  // Becomes true when the current char is undefined.
12  done: boolean = false;
13
14  /**
15   * Lexer accepts a string and emits tokens while lex() is running.
16   * Interpolation delimiters should be omitted, e.g.: "name", not "{{name}}"
17   * 
18   * @example
19   * const lexer = new Lexer("Math.random() > 0.5 ? 'Heads' : 'Tails'");
20   * lexer.emitter.addEventListener("token", (e) => {
21   *   tokens.push(e.detail);
22   * });
23   **/
24  constructor(str: string) {
25    this.str = str;
26    this.emitter = new EventTarget();
27  }
28
29  // Our parser will listen to the "token" event, storing tokens as they are emitted.
30  emit(type: TokenType, value: unknown) {
31    this.emitter.dispatchEvent(new CustomEvent("token", { detail: { type, value }}));
32  }
33
34  // It's sometimes useful to know the character in the previous position.
35  lastChar() {
36    return this.str[this.cursor - 1];
37  }
38
39  // It can be equally as useful to know what the next character is.
40  nextChar() {
41    return this.str[this.cursor + 1];
42  }
43
44  char() {
45    return this.str[this.cursor];
46  }
47
48  next() {
49    return this.str[++this.cursor];
50  }
51
52  lex() {
53    // ...
54  }
55}

A token is a sequence or pattern of characters that inherits some classification. TokenType is ane num of the different classifications I’ll be lexing for.

Token is the type that Lexer will emit.

 1// These are the types of tokens we need to know about in order
 2// to support the features we want.
 3export enum TokenType {
 4  string = "string",
 5  number = "number",
 6  operator = "operator",
 7  separator = "separator",
 8  parenOpen = "paren-open",
 9  parenClose = "paren-close",
10  pipe = "pipe",
11  variable = "variable",
12  boolean = "boolean",
13  filter-fn = "filter-fn",
14}
15
16export type Token = {
17  type: TokenType;
18  value: string;
19};

Now when I encounter a token, I can emit it. If I were to lex the expression:

15 + "!"

I want to end up with a set of tokens like this:

1[
2  { type: "number", value: 5 },
3  { type: "operator", value: "+" },
4  { type: "string", value: "!" }
5]

The main lex function is responsible for delegating work to sub-lexers when it sees a character that should be handled as that type of token.

 1  lex() {
 2    while (!this.done) {
 3      if (this.char() === undefined) {
 4        this.done = true;
 5        continue;
 6      }
 7
 8      if (this.char().match(/^['"]/)) {
 9        return this.lexString();
10      }
11    }
12  }

Above, lex returns lexString when it encounters a single or double quote. Similarly, lexString returns to lex when it finds a matching quote. Before it returns, it emits a TokenType.string.

 1  lexString() {
 2    let startingChar = this.char();
 3    let str = this.char();
 4    let done = false;
 5
 6    while (!this.done && !done) {
 7      this.next();
 8
 9      if (!this.char()) {
10        done = true;
11        break;
12      }
13
14      // If we found a quote that matches the one we started with,
15      /// and it's not prefixed by an escape character, this is the end of the string.
16      if (this.char() === startingChar && this.lastChar() !== "\\") {
17        str = str + this.char();
18        done = true;
19        this.next();
20      } else {
21        str = str + this.char();
22      }
23    }
24
25    this.emit(TokenType.string, str);
26    return this.lex();
27  }

This is a very common pattern that basically all sub-lexers follow.

Contextual lexing

The tokens that get generated will in some cases depend on some context. For example, when a pipe character is reached, an identifier that would normally be lexed as type variable is instead classified as filter-fn, while identifiers within parenthesis will still be classified as variable (or keywords, strings, etc.).

In our data model, a filter function is technically just another variable scoped as any other variable would be, e.g.:

Model:

1const model = {
2  person: {
3    name: "Buzz",
4  },
5  greetPerson: (name: string, excitement: number = 0) => {
6    return `Hi ${name}` + ("!" * excitement);
7  },
8};

Expression:

Variable vs. filter function

Even though a filter function is technically just another variable, as they exist within the same scope of the same model, for code generation purposes it will still prove useful to classify them differently.

Template literals

String interpolation inside of a template literal allows for embedded expressions.

Template literal

At first this seemed challenging, but there’s actually a nice solution to this.

When the expression portion of a template literal is entered, lexing logic falls back to the base lexer because it knows how to read expressions. The main difference is that because inTemplateLiteral is now true, when a } character is seen, logic once again returns to the template literal lexer to continue until its matching backtick is seen.

Something like this:

Template literal

Here’s the output:

 1[
 2  { "type": "string", "value": "`The coin landed on ${" },
 3  { "type": "variable", "value": "[\"Math\"][\"random\"]" },
 4  { "type": "paren-open", "value": "(" },
 5  { "type": "paren-close", "value": ")" },
 6  { "type": "operator", "value": ">" },
 7  { "type": "number", "value": 0.5 },
 8  { "type": "operator", "value": "?" },
 9  { "type": "string", "value": "'heads'" },
10  { "type": "operator","value": ":" },
11  { "type": "string", "value": "'tails'" },
12  { "type": "string", "value": "}! Great!`" }
13]

Nice.

You’ll notice that I’m capturing `The coin landed on ${" instead of just The coin landed on. Likewise, the last token’s string value is }! Great!` instead of Great!. This is intentional and it makes reconstructing this template literal easier later on when generating code.

Also for code generation purposes, the token value for Math.random has been modified to the bracket notation representation of the same thing. Later, this will expand to either model["Math"]["random"] or window["Math"]["random"] if the former is undefined. More on this later.

Other considerations

Reserved words

Reserved words should not be classified as variables. Identifiers with this name will never exist in the model. Instead, once an identifier has been found, it’s compared against a list of all reserved words. If there’s a match, it’s classified as TokenType.keyword.

Reserved words in most cases expect a space character to follow:, e.g.

1else return true

Not:

1elsereturntrue

The lexer currently ignores spaces altogether. So if the token is a keyword, I’ll explicitly wrap it in spaces to be safe.

1lexIdentifier() {
2  // ...
3  emit(TokenType.keyword, ` ${str} `);
4}

Taking it further: nested contexts

In it’s current state, the lexer wouldn’t handle the following expression very well:

1{ hello: "world" }

The generated tokens would look like this:

1[
2  { "type": "variable": "value": "[\"{\"" },
3  { "type": "variable": "value": "[\"hello\"]" },
4  { "type": "operator": "value": ":" },
5  { "type": "string": "value": "\"world\"" },
6  { "type": "variable": "value": "[\"}\"]" },
7]

Not great, but it makes sense. The lexer doesn’t have logic for handling objects or arrays. Sure, but it also needs to understand how to correctly classify identifiers when inside the context of either an object or array.

I’ve opted to track the current lexing context by storing an array of contexts, so that no matter how deeply nested I am, I can always know what the current context is.

1export class Lexer {
2  // ...
3  typeContexts: TypeContext[] = [];
4}
5
6enum TypeContext {
7  object = "object",
8  array = "array",
9};

I add some new rules to the lexer so that it can recognize the beginnings and endings of objects and arrays. Something like this:

 1lex() {
 2  // ... 
 3  if (this.char() === "{") {
 4    this.typeContexts.push(TypeContext.object);
 5    this.emit(TokenType.operator, "{", this.char());
 6    this.next();
 7    return this.lex();
 8  }
 9
10  if (this.char() === "}") {
11    if (this.typeContexts[this.typeContexts.length - 1] === TypeContext.object) {
12      this.typeContexts.pop();
13      this.emit(TokenType.operator, "}", this.char());
14      this.next();
15      return this.lex();
16    }
17  }
18}

On line 4, I push a new TypeContext.object to the typeContexts stack when I see a { character. Now, lexIdentifier can see if it’s currently within the context of an object. If it is, then tokens that would otherwise be classified as variable can instead more correctly be classified as object-property if they’re preceding a : operator, and variable or otherwise if they’re following a : operator (a property value).

On line 10, I check if the current character is }. If it is, I check if the current context is TypeContext.object. If it is, I pop the current context off the stack and emit a } token.

I’ll repeat this logic for arrays as well. You can visualize it like this:

Tracking contexts

Now that I know the most recent context, I can emit the correct token type. Like this:

Tracking contexts, correct token type

Now that I’m tracking context in this way, the following expression generates usable tokens:

1{ hello: "world", arr: [1, name, { abc: 123 }, "a string"] }

Output tokens:

 1[
 2  { type: "operator", value: "{" },
 3  { type: "object-property", value: "hello" },
 4  { type: "operator", value: ":" },
 5  { type: "string", value: '"world"' },
 6  { type: "separator", value: "," },
 7  { type: "object-property", value: "arr" },
 8  { type: "operator", value: ":" },
 9  { type: "operator", value: "[" },
10  { type: "number", value: 1 },
11  { type: "separator", value: "," },
12  { type: "variable", value: '["name"]' },
13  { type: "separator", value: "," },
14  { type: "operator", value: "{" },
15  { type: "object-property", value: "abc" },
16  { type: "operator", value: ":" },
17  { type: "number", value: 123 },
18  { type: "operator", value: "}" },
19  { type: "separator", value: "," },
20  { type: "object-property", value: '"a string"' },
21  { type: "operator", value: "]" },
22  { type: "operator", value: "}" },
23  { type: "filterPipe", value: "|" },
24  { type: "filter-fn", value: '["stringify"]' },
25]

Great! No matter how deeply nested the expression goes, we can count on the lexer to generate the right tokens.

Even this annoying and pointless nested template literal:

1`Nonsense ${Math.random() > 0.5 ? `G: ${Math.random()}` : `L: ${Math.random()}` }`

Which gives me these very usable tokens:

 1[
 2  { type: "string", value: "`Nonsense ${" },
 3  { type: "variable", value: '["Math"]["random"]' },
 4  { type: "paren-open", value: "(" },
 5  { type: "paren-close", value: ")" },
 6  { type: "operator", value: ">" },
 7  { type: "number", value: 0.5 },
 8  { type: "operator", value: "?" },
 9  { type: "string", value: "`G: ${" },
10  { type: "variable", value: '["Math"]["random"]' },
11  { type: "paren-open", value: "(" },
12  { type: "paren-close", value: ")" },
13  { type: "string", value: "}`" },
14  { type: "operator", value: ":" },
15  { type: "string", value: "`L: ${" },
16  { type: "variable", value: '["Math"]["random"]' },
17  { type: "paren-open", value: "(" },
18  { type: "paren-close", value: ")" },
19  { type: "string", value: "}`" },
20  { type: "string", value: "}`" },
21]