Parser Implementation Guide¶

This guide covers best practices for implementing a PTA parser with error recovery and source location tracking.

Error Recovery Philosophy¶

Parse as much as possible - Don't stop at first error
Produce useful AST - Partial results are valuable
Accurate locations - Errors point to exact source positions
Cascading prevention - Avoid spurious errors from earlier failures

Source Locations¶

Span Type¶

Track byte offsets for precise source mapping:

Span {
    start: usize,  // Byte offset of start
    end: usize,    // Byte offset of end (exclusive)
}

Operations: - len() - Length in bytes - merge(other) - Combine two spans - text(source) - Extract text from source

Source Location¶

Convert byte offsets to human-readable locations:

SourceLocation {
    file: Path,      // File path
    line: u32,       // 1-based line number
    column: u32,     // 1-based column number
    length: u32,     // Length in characters
    span: Span,      // Original byte span
}

Spanned AST Nodes¶

Every AST node should carry its span:

Spanned<T> {
    value: T,
    span: Span,
}

This enables precise error reporting for any element.

Recovery Strategies¶

1. Synchronization Points¶

Recover at well-defined syntax boundaries:

parse_directive():
    try:
        return parse_directive_inner()
    catch error:
        errors.push(error)
        synchronize_to_next_directive()
        return RECOVERED

synchronize_to_next_directive():
    while not at_end():
        skip_to_newline()
        advance_newline()
        if peek_is_date():
            return

2. Error Productions¶

Include error cases in the parser:

parse_posting():
    account = parse_account()

    try:
        units = parse_amount()
    catch recoverable_error:
        errors.push(warning("Invalid amount, treating as missing"))
        units = None  // Treat as interpolated posting

    // Continue parsing cost, price, etc.
    return Posting { account, units, ... }

3. Insertion Recovery¶

Insert missing tokens when safe:

parse_transaction():
    date = parse_date()

    if check(STAR) or check(BANG):
        flag = parse_flag()
    else:
        errors.push(warning("Missing transaction flag, assuming '*'"))
        flag = Spanned(Flag.Complete, current_span())

    // Continue...

4. Deletion Recovery¶

Skip unexpected tokens:

parse_postings():
    postings = []

    while check_indent():
        try:
            postings.push(parse_posting())
        catch error:
            errors.push(error)
            skip_to_newline()  // Skip bad posting

    return postings

Error Message Quality¶

Quality Criteria¶

Specific - "Expected ')' to close '(' at line 42" not "Syntax error"
Actionable - Suggest fixes when possible
Located - Point to exact character
Contextual - Show surrounding code

Error Structure¶

ParseError {
    kind: ParseErrorKind,
    message: String,
    span: Span,
    notes: Vec<Note>,
    suggestions: Vec<Suggestion>,
}

Note {
    message: String,
    span: Option<Span>,
}

Suggestion {
    message: String,
    replacement: String,
    span: Span,
}

Example Error Output¶

error[E0001]: Unexpected token
  --> ledger.beancount:42:15
   |
42 |   Assets:Cash  100 $USD
   |               ^^^^
   |               expected amount, found '$'
   |
   = note: currency names cannot start with '$'
   = suggestion: remove the '$' prefix

Source Location Through Transformations¶

The Challenge¶

Directives are transformed through multiple phases:

Parse → AST with spans
Include expansion → Multiple files merged
Interpolation → New amounts added
Pad expansion → Synthetic transactions
Plugin processing → Arbitrary transformations

Errors in later phases must point to original source.

Approach 1: Carry Original Spans¶

Amount {
    number: Decimal,
    currency: Currency,
    span: Option<Span>,      // None if synthesized
    origin: Option<Origin>,  // For synthesized values
}

Origin =
    | Interpolated { from_transaction: Span }
    | Padded { from_pad: Span, from_balance: Span }
    | Plugin { plugin_name: String }

Approach 2: Transformation Log¶

Track all transformations:

TransformEntry =
    | Interpolated {
        transaction_span: Span,
        posting_index: usize,
        computed_amount: Amount,
    }
    | PadExpanded {
        pad_span: Span,
        balance_span: Span,
        generated_transaction: Transaction,
    }
    | PluginModified {
        plugin: String,
        original_span: Span,
        description: String,
    }

Include File Tracking¶

Maintain a source map for merged files:

SourceMap {
    files: Vec<SourceFile>,
    offset_map: Vec<(merged_offset, file_id, local_offset)>,
}

SourceFile {
    path: Path,
    content: String,
    start_offset: usize,  // Offset in merged source
}

Convert merged offsets to file locations via binary search.

Error Aggregation¶

Collecting Errors¶

ErrorCollector {
    errors: Vec<Error>,
    warnings: Vec<Error>,
    max_errors: usize,
}

methods:
    error(e) - Add error if under limit
    warning(w) - Add warning (no limit)
    should_abort() - True if max_errors reached
    finish() - Return all errors and warnings

Error Deduplication¶

Avoid duplicates from cascading failures:

error_if_new(error):
    if not errors.any(e => e.span.overlaps(error.span)):
        errors.push(error)

Cascading Prevention¶

Mark tokens as synthetic during recovery:

Token {
    kind: TokenKind,
    span: Span,
    is_synthetic: bool,  // True if inserted during recovery
}

Skip error reporting for synthetic tokens to prevent cascading.

Testing Error Recovery¶

Recovery Tests¶

test_recovery_missing_flag():
    source = """
2024-01-01 "Deposit"
  Assets:Cash  100 USD
  Income:Salary
"""
    (ledger, errors) = parse_with_recovery(source)

    assert ledger.transactions().count() == 1  // Recovered
    assert errors.any(e => e.message.contains("flag"))

test_recovery_continues_after_error():
    source = """
2024-01-01 * "First"
  Assets:Cash  100 USD
  Income:Salary

2024-01-02 * "Invalid"
  Assets:Cash  not_a_number USD
  Expenses:Food

2024-01-03 * "Third"
  Assets:Cash  50 USD
  Expenses:Food
"""
    (ledger, errors) = parse_with_recovery(source)

    assert ledger.transactions().count() == 2  // First and third
    assert errors.len() >= 1

Location Accuracy Tests¶

test_error_location():
    source = "2024-01-01 * \"Test\"\n  Invalid:Account  100 USD\n"
    (_, errors) = parse_with_recovery(source)

    assert errors[0].location.line == 2
    assert errors[0].location.column == 3

LSP Integration¶

For editor support, convert errors to LSP diagnostics:

Diagnostic {
    range: Range,          // 0-based line/column
    severity: Severity,
    code: String,
    message: String,
    related: Vec<RelatedInformation>,
}

Range {
    start: Position,
    end: Position,
}

Position {
    line: u32,      // 0-based for LSP
    character: u32, // UTF-16 code units
}

Note: LSP uses 0-based line numbers and UTF-16 character offsets.