From Source to Machine Code: 5 Compiler Steps


The code we write in an editor is not a language computers can execute directly. It is just human-friendly text. Keywords like if, for, variable names, and function calls are conventions for us, not for the CPU. At the hardware level, a processor only understands machine instructions.

A compiler is the bridge that closes that gap. It translates high-level source code into executable machine-level output through a sequence of structured stages.

In this post, we will walk through that journey in five steps, using language-agnostic examples.

Step 1: Lexical Analysis - Splitting Text into Tokens

The first stage breaks raw source code into meaningful units called tokens. Similar to how humans parse words and punctuation while reading, the compiler scans characters and groups them into categories for later stages.

For example:

result = a << 2

The important symbol here is <<. The lexer should treat it as one operator token, not two separate < characters.

In practice, the lexer will:

  • group << into one token
  • classify it as an operator token like SHIFT_LEFT
  • pass it to the parser for grammar-based interpretation

If tokenization is unstable, every stage after it inherits that instability. Many compiler bugs begin here.

Step 2: Syntax Analysis - Building the Grammar Tree

Once tokens are ready, the parser checks whether they match the language grammar and builds an AST (Abstract Syntax Tree).

Consider an array declaration:

int[] nums = {1, 2, 3};

A simplified AST might look like this:

VarDecl
├── name: nums
├── type: ArrayType
│   └── elementType: int
└── init: ArrayLiteral
    ├── 1
    ├── 2
    └── 3

The parser confirms the structure is grammatically valid and converts linear tokens into a tree that captures intent and hierarchy.

Step 3: Semantic Analysis - Verifying Meaning

Syntactically valid code is not always semantically valid. A statement like 1 + "hello" can be parsed, but still violate type rules.

Semantic analysis checks things such as:

  • type consistency
  • scope validity for variables and symbols
  • function call arguments against signatures

Example:

try {
  risky()
} catch (err) {
  print(err)
}

At this stage, the compiler verifies:

  • err is registered in the catch block scope
  • references to err inside the block are valid
  • using err outside that scope correctly raises an error

This is the gate where code moves from “looks correct” to “is logically valid.”

Step 4: Intermediate Representation - Converting to a Common Form

After semantic checks, the AST is lowered into an intermediate form such as LLVM IR. This representation is still not tied to one CPU architecture.

Why not generate machine code immediately?

  • one frontend can support multiple target architectures
  • optimization passes can run once at the IR level
  • backend complexity is reduced significantly

IR is a strategic layer for portability and optimization efficiency.

Step 5: Code Generation - Producing Machine Code

In the final stage, the backend turns IR into target-specific machine instructions (for x86, ARM, and others). The exact binary differs by architecture, while preserving the same language-level semantics.

After linking, the source code becomes an executable program. What started as plain text now runs as an actual process on the operating system.

Closing: Where This “How IT Works” Series Goes Next

This post is the entry point to a bigger question: how software really works beneath abstraction.

In the next post, we will continue with “Memory Management: Why You Need to Understand Pointers Deeply”.