Unicode Support¶

This document specifies Unicode handling in plain text accounting systems.

Overview¶

Plain text accounting files are Unicode text documents. This specification defines encoding requirements, normalization rules, and character handling.

Encoding¶

UTF-8 Requirement¶

Source files MUST be encoded in UTF-8:

UTF-8 is the only supported encoding
Other encodings (UTF-16, ISO-8859-1, etc.) MUST be rejected
Implementations MUST validate UTF-8 sequences

UTF-8 Validation¶

Invalid UTF-8 sequences MUST produce errors:

ERROR: Invalid UTF-8 sequence
  --> ledger.beancount:42:15
   |
42 |   payee: "Invalid <0xFF> byte"
   |                   ^^^^^^
   |
   = byte 0xFF is not valid in UTF-8

Byte Order Mark¶

The UTF-8 BOM (EF BB BF) at file start:

MUST be accepted (ignored)
MUST NOT be required
SHOULD NOT be generated in output

Character Categories¶

Allowed Characters¶

Category	Code Points	Example
Basic Latin	U+0020-U+007E	A-Z, 0-9, punctuation
Latin Extended	U+00A0-U+024F	é, ñ, ü
Greek	U+0370-U+03FF	α, β, γ
Cyrillic	U+0400-U+04FF	А, Б, В
CJK	U+4E00-U+9FFF	日, 本, 語
Currency Symbols	U+20A0-U+20CF	€, £, ¥
Emoji	Various	💰, 📊

Prohibited Characters¶

Category	Code Points	Reason
Control (C0)	U+0000-U+001F	Non-printable
Delete	U+007F	Control character
Control (C1)	U+0080-U+009F	Non-printable
Surrogates	U+D800-U+DFFF	UTF-16 only
Noncharacters	U+FDD0-U+FDEF	Reserved

Whitespace¶

Character	Code Point	Handling
Space	U+0020	Standard separator
Tab	U+0009	Indent/alignment
Newline	U+000A	Line terminator
Carriage Return	U+000D	Ignored (with LF)
No-Break Space	U+00A0	Treated as space

Normalization¶

NFC Normalization¶

Implementations SHOULD normalize text to NFC (Canonical Decomposition, followed by Canonical Composition):

Input: "café" (e + combining acute)
       U+0063 U+0061 U+0066 U+0065 U+0301

NFC:   "café" (precomposed é)
       U+0063 U+0061 U+0066 U+00E9

When to Normalize¶

Context	Normalization
Comparison	MUST normalize before comparing
Storage	SHOULD store normalized form
Display	Preserve user input
Hash/Index	MUST use normalized form

Normalization Forms¶

Form	Description	Use
NFC	Composed	Preferred for storage
NFD	Decomposed	Not recommended
NFKC	Compatibility composed	For search
NFKD	Compatibility decomposed	For search

Case Handling¶

Case Sensitivity¶

Context-dependent case handling:

Context	Case Handling
Account names	Case-sensitive
Commodity names	Uppercase (Beancount) or sensitive
Metadata keys	Case-sensitive (typically lowercase)
Metadata values	Case-preserved
Tags	Case-sensitive

Case Folding¶

For case-insensitive comparison, use Unicode case folding:

# Simple lowercase is insufficient for some scripts
"Straße".casefold() == "strasse"  # German sharp s

Grapheme Clusters¶

Extended Grapheme Clusters¶

A user-perceived "character" may be multiple code points:

👨‍👩‍👧 (Family emoji)
= U+1F468 U+200D U+1F469 U+200D U+1F467
= 5 code points, 1 grapheme cluster

Length Calculation¶

Method	"café"	"👨‍👩‍👧"
Code points	4	5
Grapheme clusters	4	1
UTF-8 bytes	5	18

Implementations SHOULD use code points for internal length calculations.

Identifiers¶

Account Name Characters¶

Valid characters in account names:

Letter:     A-Z a-z (Latin)
            Plus letters from other scripts (implementation-defined)
Digit:      0-9
Special:    - _ : (separator)

Commodity Name Characters¶

Beancount commodities (uppercase requirement):

Letter:     A-Z
Digit:      0-9
Special:    ' . _ -

Unicode Letters¶

For "letter" matching, use Unicode category:

import unicodedata

def is_letter(char):
    return unicodedata.category(char).startswith('L')

is_letter('A')  # True (Latin)
is_letter('日') # True (CJK)
is_letter('1')  # False (digit)

Line Handling¶

Line Terminators¶

Accepted line terminators:

Sequence	Name	Handling
U+000A	LF	Standard
U+000D U+000A	CRLF	Treated as single LF
U+000D	CR alone	Converted to LF

Line Continuation¶

No implicit line continuation. Multi-line constructs use explicit syntax.

Maximum Line Length¶

Implementations SHOULD support lines up to 10,000 characters. Lines exceeding this MAY be rejected or truncated.

String Literals¶

Escape Sequences¶

Escape	Code Point	Character
`\n`	U+000A	Newline
`\t`	U+0009	Tab
`\r`	U+000D	Carriage return
`\\`	U+005C	Backslash
`\"`	U+0022	Double quote
`\uXXXX`	U+XXXX	BMP character
`\UXXXXXXXX`	U+XXXXXXXX	Any character

Non-BMP Characters¶

Characters outside the Basic Multilingual Plane (U+10000+):

"💰" = "\U0001F4B0"  ; Money bag emoji

; In UTF-16 (for reference, not used in PTA):
; U+D83D U+DCB0 (surrogate pair)

Bidirectional Text¶

Right-to-Left Scripts¶

Hebrew, Arabic, and other RTL scripts are supported:

2024-01-15 * "קניה"  ; Hebrew
  הוצאות:מזון  50 ILS
  נכסים:בנק

Bidirectional Algorithm¶

Display follows Unicode Bidirectional Algorithm (UAX #9). Storage is always in logical order.

Explicit Direction¶

Directional formatting characters (LRM, RLM, etc.) MAY be used but are not recommended.

Implementation Notes¶

String Storage¶

class UnicodeString:
    _data: str  # Python str (UTF-8 or UTF-32 internal)

    def normalize(self) -> 'UnicodeString':
        import unicodedata
        return UnicodeString(unicodedata.normalize('NFC', self._data))

    def code_points(self) -> int:
        return len(self._data)

    def graphemes(self) -> int:
        import grapheme
        return grapheme.length(self._data)

Comparison¶

def equals(a: str, b: str) -> bool:
    import unicodedata
    return unicodedata.normalize('NFC', a) == unicodedata.normalize('NFC', b)

Validation¶

def validate_utf8(data: bytes) -> str:
    try:
        return data.decode('utf-8')
    except UnicodeDecodeError as e:
        raise ParseError(f"Invalid UTF-8 at byte {e.start}")

Error Messages¶

Invalid Encoding¶

ERROR: Invalid UTF-8 encoding
  --> ledger.beancount:1:1
   |
   = file is not valid UTF-8
   = hint: convert using 'iconv -f <encoding> -t utf-8'

Prohibited Character¶

ERROR: Prohibited control character
  --> ledger.beancount:42:15
   |
42 |   note: "Text with <NUL> inside"
   |                     ^^^^^
   |
   = U+0000 (NULL) is not allowed

Cross-Format Notes¶

Feature	Beancount	Ledger	hledger
Encoding	UTF-8 only	Multiple	UTF-8
Normalization	NFC	None	None
Case in accounts	Sensitive	Configurable	Sensitive
Unicode escapes	Yes	Limited	Yes
RTL support	Limited	Limited	Limited