In this stage, the parser takes the list of tokens produced in the previous phase and generate a tree representation called an abstract syntax tree(AST).
But why AST? why not anything else? After tokenisation, the compiler needs to understand how those tokens fit together. Most programming languages are structured in layers, where small elements combine to form bigger ones. Since this structure is hierarchical, tree-like representations are a natural choice.
The parser reads tokens in order, applies grammar rules, and builds tree-shaped data structures called AST(Abstract Syntax Tree) that represent expressions, statements, and control flow.
<aside> <img src="/icons/info-alternate_blue.svg" alt="/icons/info-alternate_blue.svg" width="40px" />
The final AST doesn’t include every token from the original input. That’s why its called Abstract Syntax Tree. In above diagram, the output AST doesn’t include tokens like EQUAL or SEMICOLON.
</aside>
To build an AST, a parser needs a set of rules. This ruleset is called a formal grammar. It’s usually written in a BNF(Backus-Naur form)- or EBNF(extended Backus-Naur form)- like notation.
Below is the example grammar written in EBNF-ish notation.
<!--
'→' denotes a production rule (left-hand side expands to right-hand side)
'|' denotes alternatives
'*' denotes zero or more repetitions
Parentheses '()' are used for grouping
'EOF' represents the end of the input file
Each line in this grammar acts as a production rule, showing how a language construct is built from tokens and other constructs.
-->
program → statement* EOF
statement → var_decl
var_decl → "var" IDENTIFIER "=" expression ";"
expression → assignment
assignment → IDENTIFIER "=" assignment
| term
term → factor ( ("+" | "-") factor )*
factor → unary ( ("*" | "/") unary )*
unary → ("!" | "-") unary
| primary
primary → NUMBER | STRING | IDENTIFIER
<aside> <img src="/icons/info-alternate_blue.svg" alt="/icons/info-alternate_blue.svg" width="40px" />
A formal grammar is intended to use as a reference while writing the parser, so I won’t explicitly define these rules inside the compiler.
</aside>