Tree-sitter Explained

Mar 26, 2024

For a long time now, I’ve wanted a syntax highlighter that would make my terminal highlighter (highlight) and text editor (neovim) agree on the color of the syntax highlight. So I came across tree-sitter, an external project that neovim uses to highlight stuff. I read a lot and found that many resources teach tree-sitter as only a neovim dependency and talk about neovim exclusively. I want this post to serve as a resource for non-neovim tree-sitter knowledge.

The Basics: Tree-sitter and Parsers

So what exactly is tree-sitter? Tree-sitter is a parser generator, which means given a programming language grammar, tree-sitter creates a parser.

Grammar ----> [ Tree-sitter ] ----> Parser

A parser is a program to turn human-readable code into an abstract syntax tree (AST) which computers understand. This is useful for something that needs to understand code, like syntax highlighters and compilers.

Code -----> [ Parser ] ----> AST

Note that in compiling code to a binary, parsing is the first step.

Language Binding

As useful as a parsing sounds, you can’t really use it alone. You need to be able to call the parser from the rest of your project. This is done by using a language binding, which is essentially a library for your project’s language which binds to the parser.

For example, when creating a python compiler written in rust, you need a python language parser and a rust language binding.

1.
+-----------------------------+
| Compiler source code (Rust) |
| For Target Language: Python | Compile
|              +              | -------> [Python Compiler]
|    Rust Language Binding    |
+-----------------------------+

2.
Code -> [Python Compiler] -> Results
          | Calls   ^
          V         | AST
         [Python Parser]

Parsing In Practice

In Tree-sitter, you a language’s grammar is specified as a grammar.js file. 1 Tree-sitter also needs some other information as well, so most grammars are whole npm projects like this c language grammar. Also, all common languages have specified tree-sitter grammar and language bindings available here.

The output of the parser, the AST, is represented as an S-expression. For example, this source code:

#include <stdio.h>
int main(){
  printf("Hello World");
}

The parser outputs this expression (comments are row and column numbers):

(preproc_include) ; [1:1 - 3:0]
 path: (system_lib_string) ; [1:10 - 18]
(function_definition) ; [3:1 - 5:1]
 type: (primitive_type) ; [3:1 - 3]
 declarator: (function_declarator) ; [3:5 - 10]
  declarator: (identifier) ; [3:5 - 8]
  parameters: (parameter_list) ; [3:9 - 10]
 body: (compound_statement) ; [3:11 - 5:1]
  (expression_statement) ; [4:3 - 24]
   (call_expression) ; [4:3 - 23]
    function: (identifier) ; [4:3 - 8]
    arguments: (argument_list) ; [4:9 - 23]
     (string_literal) ; [4:10 - 22]
;    and so on...

Syntax Highlighting

A syntax highlighter has 3 parts:

Something that parses the code into an AST
Something that reads the AST and categorizes each part into a syntax category
Something that reads a color theme file and colors the text based on the syntax category

In a basic syntax highlighter, parts 1 and 2 are merged. You use a regular expressions to find syntax categories by reading the code, no AST needed. But in a more complex highlighter, there are all 3 parts. The difference between a basic one and an AST based one is the latter can more reliably parse code since it “understands” the grammar of the code, not just searching for strings

Using tree-sitter, we have part 1 done with the parser.

For categorizing syntax, tree-sitter uses tree queries 2, which is like asking the AST with a query expression. In practice, this is done in using queries in a highlights.scm file. You specify queries for node of the tree, and assign highlight names (syntax categories).

(type_identifier) @type ; assign @type to AST's type_identifier node
(int_literal) @number ; assign @number to int_literal node
"func" @keyword ; assign @keyword to "func" string
"return" @keyword ; ...

For part 3, you can use the highlight names to color your file in any way you want. It’s not tied to tree-sitter.

Hands-on: Let’s Try Syntax Highlighting

Tree-sitter is just a parser generator, but it can also be used as a tool to test parsing and highlighting. So let’s try it out!

In this example, we’ll try to highlight some c language code.

First you need to have the c parser. We will get it from the tree-sitter-c repository. Note that for the tree-sitter CLI to be able to test our language, it the project directory needs to be named tree-sitter-something. 3

# This will be cloned to /path/to/my-parent-dir/tree-sitter-c
cd my-parent-dir
git clone https://github.com/tree-sitter/tree-sitter-c.git

Then, the parent directory needs to be declared in your tree-sitter config file. In linux this would be ~/.config/tree-sitter/config.json.

Add the following line:

 {
   "parser-directories": [

    "/path/to/my-parent-dir"
   ]
 }

The color theme is also defined there. 4

Then create a c file to test the syntax highlighting:

// test.c
#include <stdio.h>
int main(){
  printf("Hello World");
}

Finally, run the tree-sitter CLI:

tree-sitter highlight test.c

Neovim

So how does neovim fit into the picture? Neovim is like a syntax highlighter to a tree-sitter parser. It calls the parser with its own binding and parses the AST and does syntax highlighting by itself.

Code ----> [ Neovim ] ----> Colored Code
            | Code ^
            v      | AST
           [ Parser ]

The neovim package nvim-treesitter’s relationship with neovim is: neovim does the reading AST and coloring on its own, and nvim-treesitter does the surrounding stuff. It loads parsers and queries for neovim. This can be seen in this file ~/.local/share/nvim/nvim-treesitter/lua/nvim-treesitter/parsers.lua where nvim-treesitter just registers parsers for neovim.

In debugging tree-sitter, you can use InspectTree to show the AST.

Neovim’s Compiled Parsers

One important aspect of neovim’s parsers are that they are compiled binaries. This makes them portable, which makes them an attractive target for a multitude of applications. This can be done because neovim uses lua, and lua compiles libraries into .so files

You can find some compiled parsers in ~/.local/share/nvim/nvim-treesitter/parser/langname.so

Compiling them is just as easy, by doing: 5

cd tree-sitter-langname
gcc -o langname.so -shared src/parser.c src/scanner.c -I./src -lstdc++

So this is just the basics of tree-sitter parsers, read more at https://tree-sitter.github.io

I hope this post serves to help people understand tree-sitter better, not just for neovim applications.

Toftpokk’s Substack