The code we write in an editor is not a language computers can execute directly. It is just human-friendly text. Keywords like if, for, variable names, and function calls are conventions for us, not for the CPU. At the hardware level, a processor only understands machine instructions.
A compiler is the bridge that closes that gap. It translates high-level source code into executable machine-level output through a sequence of structured stages.
In this post, we will walk through that journey in five steps, using language-agnostic examples.
Step 1: Lexical Analysis - Splitting Text into Tokens
The first stage breaks raw source code into meaningful units called tokens. Similar to how humans parse words and punctuation while reading, the compiler scans characters and groups them into categories for later stages.
For example:
result = a << 2
The important symbol here is <<. The lexer should treat it as one operator token, not two separate < characters.
In practice, the lexer will:
- group
<<into one token - classify it as an operator token like
SHIFT_LEFT - pass it to the parser for grammar-based interpretation
If tokenization is unstable, every stage after it inherits that instability. Many compiler bugs begin here.
Step 2: Syntax Analysis - Building the Grammar Tree
Once tokens are ready, the parser checks whether they match the language grammar and builds an AST (Abstract Syntax Tree).
Consider an array declaration:
int[] nums = {1, 2, 3};
A simplified AST might look like this:
VarDecl
├── name: nums
├── type: ArrayType
│ └── elementType: int
└── init: ArrayLiteral
├── 1
├── 2
└── 3
The parser confirms the structure is grammatically valid and converts linear tokens into a tree that captures intent and hierarchy.
Step 3: Semantic Analysis - Verifying Meaning
Syntactically valid code is not always semantically valid. A statement like 1 + "hello" can be parsed, but still violate type rules.
Semantic analysis checks things such as:
- type consistency
- scope validity for variables and symbols
- function call arguments against signatures
Example:
try {
risky()
} catch (err) {
print(err)
}
At this stage, the compiler verifies:
erris registered in the catch block scope- references to
errinside the block are valid - using
erroutside that scope correctly raises an error
This is the gate where code moves from “looks correct” to “is logically valid.”
Step 4: Intermediate Representation - Converting to a Common Form
After semantic checks, the AST is lowered into an intermediate form such as LLVM IR. This representation is still not tied to one CPU architecture.
Why not generate machine code immediately?
- one frontend can support multiple target architectures
- optimization passes can run once at the IR level
- backend complexity is reduced significantly
IR is a strategic layer for portability and optimization efficiency.
Step 5: Code Generation - Producing Machine Code
In the final stage, the backend turns IR into target-specific machine instructions (for x86, ARM, and others). The exact binary differs by architecture, while preserving the same language-level semantics.
After linking, the source code becomes an executable program. What started as plain text now runs as an actual process on the operating system.
Closing: Where This “How IT Works” Series Goes Next
This post is the entry point to a bigger question: how software really works beneath abstraction.
In the next post, we will continue with “Memory Management: Why You Need to Understand Pointers Deeply”.