String Type¶

This document specifies the string type used for text values in plain text accounting.

Definition¶

A String is a sequence of Unicode code points representing text. Strings are used for narrations, payees, metadata values, account names, and other textual data.

Encoding¶

Source File Encoding¶

Source files MUST be encoded in UTF-8:

Content-Type: text/plain; charset=utf-8

Implementations MUST reject files with invalid UTF-8 sequences.

BOM Handling¶

UTF-8 Byte Order Mark (BOM) at file start SHOULD be ignored:

EF BB BF ... (UTF-8 BOM)

Implementations MUST NOT require BOM and MUST NOT include BOM in output.

String Literals¶

Quoted Strings¶

String literals are enclosed in double quotes:

"Hello, World"
"Monthly salary"
"Café au lait"

Escape Sequences¶

Standard escape sequences:

Escape	Meaning
`\"`	Double quote
`\\`	Backslash
`\n`	Newline
`\t`	Tab
`\r`	Carriage return

Unicode Escapes¶

Unicode code points can be escaped:

Format	Example	Character
`\uXXXX`	`\u00E9`	é (U+00E9)
`\UXXXXXXXX`	`\U0001F4B0`	💰 (U+1F4B0)

Raw Strings¶

Some formats support raw strings without escape processing:

r"C:\Users\name\file.txt"

Character Classes¶

Allowed Characters¶

String literals MAY contain:

Printable ASCII (0x20-0x7E)
Non-ASCII Unicode (U+0080 and above)
Escaped control characters

Prohibited Characters¶

Unescaped control characters (0x00-0x1F, 0x7F) are prohibited:

"Hello\x00World"  // Invalid: embedded NUL
"Line1\nLine2"    // Valid: escaped newline

Whitespace¶

Character	Allowed	Notes
Space (0x20)	Yes
Tab (0x09)	Escaped only	`\t`
Newline (0x0A)	Escaped only	`\n`
Carriage return (0x0D)	Escaped only	`\r`

Identifiers¶

Unquoted Identifiers¶

Some string contexts allow unquoted identifiers:

Assets:Checking      ; Account name (unquoted)
USD                  ; Commodity (unquoted)

Identifier Rules¶

Context	Allowed Characters	Case
Account	`A-Za-z0-9:_-`	Sensitive
Commodity	`A-Z0-9'._-`	Uppercase (Beancount)
Metadata key	`a-z0-9_-`	Lowercase

Quoting Requirement¶

Identifiers with special characters require quoting:

"Account Name"       ; Space requires quotes
"US Dollar"          ; Multi-word
"S&P 500"            ; Special characters

String Operations¶

Concatenation¶

Strings can be concatenated:

"Hello" + " " + "World" = "Hello World"

Comparison¶

String comparison is byte-for-byte after normalization:

"café" == "café"      # Depends on normalization
"CAFÉ" == "café"      # False (case-sensitive)

Length¶

Length is measured in:

Unit	Description
Code points	Unicode characters
Bytes	UTF-8 encoded bytes
Graphemes	User-perceived characters

Implementations SHOULD use code points for length calculations.

Normalization¶

Unicode Normalization¶

Implementations SHOULD normalize strings to NFC (Canonical Decomposition, followed by Canonical Composition):

é (U+00E9)           ; Precomposed
e + ́ (U+0065 U+0301) ; Decomposed

After NFC: é (U+00E9)

Case Normalization¶

Case handling depends on context:

Context	Case Handling
Account names	Preserved (case-sensitive)
Commodities	Uppercase (Beancount) or preserved
Metadata keys	Lowercase
Metadata values	Preserved
Narrations	Preserved

Empty Strings¶

Empty vs. Null¶

Empty string is a valid value, distinct from null/missing:

""                   ; Empty string (present but empty)
; vs.
; (field omitted)    ; Null/missing

Empty Payee¶

Empty payee is distinct from no payee:

2024-01-15 * "" "Narration"     ; Empty payee
2024-01-15 * "Narration"        ; No payee

String in Different Contexts¶

Narration¶

Free-form transaction description:

2024-01-15 * "Weekly grocery shopping at Whole Foods"

Payee¶

Transaction counterparty:

2024-01-15 * "Whole Foods" "Weekly groceries"

Metadata Value¶

Arbitrary string values:

  receipt: "receipts/2024/01/15-grocery.pdf"
  notes: "Bought items for office party"

Tag¶

Hash-prefixed identifier:

#project-2024
#tax-deductible

Link¶

Caret-prefixed identifier:

^invoice-001
^receipt-abc123

Validation¶

Encoding Error¶

ERROR: Invalid UTF-8 sequence
  --> ledger.beancount:42:15
   |
42 |   payee: "Invalid \xFF byte"
   |                    ^^^^
   |
   = not a valid UTF-8 sequence

Unterminated String¶

ERROR: Unterminated string literal
  --> ledger.beancount:42:8
   |
42 | 2024-01-15 * "Missing end quote
   |              ^^^^^^^^^^^^^^^^^^
   |
   = expected closing '"'

Invalid Escape¶

ERROR: Invalid escape sequence
  --> ledger.beancount:42:15
   |
42 |   note: "Invalid \q escape"
   |                   ^^
   |
   = valid escapes: \\ \" \n \t \r \uXXXX

Implementation¶

Memory Model¶

@dataclass(frozen=True)
class String:
    value: str  # Unicode string

    def __eq__(self, other: 'String') -> bool:
        # Normalize before comparison
        return unicodedata.normalize('NFC', self.value) == \
               unicodedata.normalize('NFC', other.value)

    def __len__(self) -> int:
        # Length in code points
        return len(self.value)

Interning¶

Frequently-used strings SHOULD be interned:

class StringInterner:
    _cache: Dict[str, str] = {}

    def intern(self, s: str) -> str:
        if s not in self._cache:
            self._cache[s] = s
        return self._cache[s]

Cross-Format Notes¶

Feature	Beancount	Ledger	hledger
Encoding	UTF-8 only	Multiple	UTF-8
Quote char	`"`	`"`	`"`
Escapes	Yes	Yes	Yes
Raw strings	No	No	No
Case in accounts	Sensitive	Configurable	Sensitive