Skip to content

String Type

This document specifies the string type used for text values in plain text accounting.

Definition

A String is a sequence of Unicode code points representing text. Strings are used for narrations, payees, metadata values, account names, and other textual data.

Encoding

Source File Encoding

Source files MUST be encoded in UTF-8:

Content-Type: text/plain; charset=utf-8

Implementations MUST reject files with invalid UTF-8 sequences.

BOM Handling

UTF-8 Byte Order Mark (BOM) at file start SHOULD be ignored:

EF BB BF ... (UTF-8 BOM)

Implementations MUST NOT require BOM and MUST NOT include BOM in output.

String Literals

Quoted Strings

String literals are enclosed in double quotes:

"Hello, World"
"Monthly salary"
"Café au lait"

Escape Sequences

Standard escape sequences:

Escape Meaning
\" Double quote
\\ Backslash
\n Newline
\t Tab
\r Carriage return

Unicode Escapes

Unicode code points can be escaped:

Format Example Character
\uXXXX \u00E9 é (U+00E9)
\UXXXXXXXX \U0001F4B0 💰 (U+1F4B0)

Raw Strings

Some formats support raw strings without escape processing:

r"C:\Users\name\file.txt"

Character Classes

Allowed Characters

String literals MAY contain:

  • Printable ASCII (0x20-0x7E)
  • Non-ASCII Unicode (U+0080 and above)
  • Escaped control characters

Prohibited Characters

Unescaped control characters (0x00-0x1F, 0x7F) are prohibited:

"Hello\x00World"  // Invalid: embedded NUL
"Line1\nLine2"    // Valid: escaped newline

Whitespace

Character Allowed Notes
Space (0x20) Yes
Tab (0x09) Escaped only \t
Newline (0x0A) Escaped only \n
Carriage return (0x0D) Escaped only \r

Identifiers

Unquoted Identifiers

Some string contexts allow unquoted identifiers:

Assets:Checking      ; Account name (unquoted)
USD                  ; Commodity (unquoted)

Identifier Rules

Context Allowed Characters Case
Account A-Za-z0-9:_- Sensitive
Commodity A-Z0-9'._- Uppercase (Beancount)
Metadata key a-z0-9_- Lowercase

Quoting Requirement

Identifiers with special characters require quoting:

"Account Name"       ; Space requires quotes
"US Dollar"          ; Multi-word
"S&P 500"            ; Special characters

String Operations

Concatenation

Strings can be concatenated:

"Hello" + " " + "World" = "Hello World"

Comparison

String comparison is byte-for-byte after normalization:

"café" == "café"      # Depends on normalization
"CAFÉ" == "café"      # False (case-sensitive)

Length

Length is measured in:

Unit Description
Code points Unicode characters
Bytes UTF-8 encoded bytes
Graphemes User-perceived characters

Implementations SHOULD use code points for length calculations.

Normalization

Unicode Normalization

Implementations SHOULD normalize strings to NFC (Canonical Decomposition, followed by Canonical Composition):

é (U+00E9)           ; Precomposed
e + ́ (U+0065 U+0301) ; Decomposed

After NFC: é (U+00E9)

Case Normalization

Case handling depends on context:

Context Case Handling
Account names Preserved (case-sensitive)
Commodities Uppercase (Beancount) or preserved
Metadata keys Lowercase
Metadata values Preserved
Narrations Preserved

Empty Strings

Empty vs. Null

Empty string is a valid value, distinct from null/missing:

""                   ; Empty string (present but empty)
; vs.
; (field omitted)    ; Null/missing

Empty Payee

Empty payee is distinct from no payee:

2024-01-15 * "" "Narration"     ; Empty payee
2024-01-15 * "Narration"        ; No payee

String in Different Contexts

Narration

Free-form transaction description:

2024-01-15 * "Weekly grocery shopping at Whole Foods"

Payee

Transaction counterparty:

2024-01-15 * "Whole Foods" "Weekly groceries"

Metadata Value

Arbitrary string values:

  receipt: "receipts/2024/01/15-grocery.pdf"
  notes: "Bought items for office party"

Tag

Hash-prefixed identifier:

#project-2024
#tax-deductible

Caret-prefixed identifier:

^invoice-001
^receipt-abc123

Validation

Encoding Error

ERROR: Invalid UTF-8 sequence
  --> ledger.beancount:42:15
   |
42 |   payee: "Invalid \xFF byte"
   |                    ^^^^
   |
   = not a valid UTF-8 sequence

Unterminated String

ERROR: Unterminated string literal
  --> ledger.beancount:42:8
   |
42 | 2024-01-15 * "Missing end quote
   |              ^^^^^^^^^^^^^^^^^^
   |
   = expected closing '"'

Invalid Escape

ERROR: Invalid escape sequence
  --> ledger.beancount:42:15
   |
42 |   note: "Invalid \q escape"
   |                   ^^
   |
   = valid escapes: \\ \" \n \t \r \uXXXX

Implementation

Memory Model

@dataclass(frozen=True)
class String:
    value: str  # Unicode string

    def __eq__(self, other: 'String') -> bool:
        # Normalize before comparison
        return unicodedata.normalize('NFC', self.value) == \
               unicodedata.normalize('NFC', other.value)

    def __len__(self) -> int:
        # Length in code points
        return len(self.value)

Interning

Frequently-used strings SHOULD be interned:

class StringInterner:
    _cache: Dict[str, str] = {}

    def intern(self, s: str) -> str:
        if s not in self._cache:
            self._cache[s] = s
        return self._cache[s]

Cross-Format Notes

Feature Beancount Ledger hledger
Encoding UTF-8 only Multiple UTF-8
Quote char " " "
Escapes Yes Yes Yes
Raw strings No No No
Case in accounts Sensitive Configurable Sensitive