Skip to content

Unicode Support

This document specifies Unicode handling in plain text accounting systems.

Overview

Plain text accounting files are Unicode text documents. This specification defines encoding requirements, normalization rules, and character handling.

Encoding

UTF-8 Requirement

Source files MUST be encoded in UTF-8:

  • UTF-8 is the only supported encoding
  • Other encodings (UTF-16, ISO-8859-1, etc.) MUST be rejected
  • Implementations MUST validate UTF-8 sequences

UTF-8 Validation

Invalid UTF-8 sequences MUST produce errors:

ERROR: Invalid UTF-8 sequence
  --> ledger.beancount:42:15
   |
42 |   payee: "Invalid <0xFF> byte"
   |                   ^^^^^^
   |
   = byte 0xFF is not valid in UTF-8

Byte Order Mark

The UTF-8 BOM (EF BB BF) at file start:

  • MUST be accepted (ignored)
  • MUST NOT be required
  • SHOULD NOT be generated in output

Character Categories

Allowed Characters

Category Code Points Example
Basic Latin U+0020-U+007E A-Z, 0-9, punctuation
Latin Extended U+00A0-U+024F é, ñ, ü
Greek U+0370-U+03FF α, β, γ
Cyrillic U+0400-U+04FF А, Б, В
CJK U+4E00-U+9FFF 日, 本, 語
Currency Symbols U+20A0-U+20CF €, £, ¥
Emoji Various 💰, 📊

Prohibited Characters

Category Code Points Reason
Control (C0) U+0000-U+001F Non-printable
Delete U+007F Control character
Control (C1) U+0080-U+009F Non-printable
Surrogates U+D800-U+DFFF UTF-16 only
Noncharacters U+FDD0-U+FDEF Reserved

Whitespace

Character Code Point Handling
Space U+0020 Standard separator
Tab U+0009 Indent/alignment
Newline U+000A Line terminator
Carriage Return U+000D Ignored (with LF)
No-Break Space U+00A0 Treated as space

Normalization

NFC Normalization

Implementations SHOULD normalize text to NFC (Canonical Decomposition, followed by Canonical Composition):

Input: "café" (e + combining acute)
       U+0063 U+0061 U+0066 U+0065 U+0301

NFC:   "café" (precomposed é)
       U+0063 U+0061 U+0066 U+00E9

When to Normalize

Context Normalization
Comparison MUST normalize before comparing
Storage SHOULD store normalized form
Display Preserve user input
Hash/Index MUST use normalized form

Normalization Forms

Form Description Use
NFC Composed Preferred for storage
NFD Decomposed Not recommended
NFKC Compatibility composed For search
NFKD Compatibility decomposed For search

Case Handling

Case Sensitivity

Context-dependent case handling:

Context Case Handling
Account names Case-sensitive
Commodity names Uppercase (Beancount) or sensitive
Metadata keys Case-sensitive (typically lowercase)
Metadata values Case-preserved
Tags Case-sensitive

Case Folding

For case-insensitive comparison, use Unicode case folding:

# Simple lowercase is insufficient for some scripts
"Straße".casefold() == "strasse"  # German sharp s

Grapheme Clusters

Extended Grapheme Clusters

A user-perceived "character" may be multiple code points:

👨‍👩‍👧 (Family emoji)
= U+1F468 U+200D U+1F469 U+200D U+1F467
= 5 code points, 1 grapheme cluster

Length Calculation

Method "café" "👨‍👩‍👧"
Code points 4 5
Grapheme clusters 4 1
UTF-8 bytes 5 18

Implementations SHOULD use code points for internal length calculations.

Identifiers

Account Name Characters

Valid characters in account names:

Letter:     A-Z a-z (Latin)
            Plus letters from other scripts (implementation-defined)
Digit:      0-9
Special:    - _ : (separator)

Commodity Name Characters

Beancount commodities (uppercase requirement):

Letter:     A-Z
Digit:      0-9
Special:    ' . _ -

Unicode Letters

For "letter" matching, use Unicode category:

import unicodedata

def is_letter(char):
    return unicodedata.category(char).startswith('L')

is_letter('A')  # True (Latin)
is_letter('日') # True (CJK)
is_letter('1')  # False (digit)

Line Handling

Line Terminators

Accepted line terminators:

Sequence Name Handling
U+000A LF Standard
U+000D U+000A CRLF Treated as single LF
U+000D CR alone Converted to LF

Line Continuation

No implicit line continuation. Multi-line constructs use explicit syntax.

Maximum Line Length

Implementations SHOULD support lines up to 10,000 characters. Lines exceeding this MAY be rejected or truncated.

String Literals

Escape Sequences

Escape Code Point Character
\n U+000A Newline
\t U+0009 Tab
\r U+000D Carriage return
\\ U+005C Backslash
\" U+0022 Double quote
\uXXXX U+XXXX BMP character
\UXXXXXXXX U+XXXXXXXX Any character

Non-BMP Characters

Characters outside the Basic Multilingual Plane (U+10000+):

"💰" = "\U0001F4B0"  ; Money bag emoji

; In UTF-16 (for reference, not used in PTA):
; U+D83D U+DCB0 (surrogate pair)

Bidirectional Text

Right-to-Left Scripts

Hebrew, Arabic, and other RTL scripts are supported:

2024-01-15 * "קניה"  ; Hebrew
  הוצאות:מזון  50 ILS
  נכסים:בנק

Bidirectional Algorithm

Display follows Unicode Bidirectional Algorithm (UAX #9). Storage is always in logical order.

Explicit Direction

Directional formatting characters (LRM, RLM, etc.) MAY be used but are not recommended.

Implementation Notes

String Storage

class UnicodeString:
    _data: str  # Python str (UTF-8 or UTF-32 internal)

    def normalize(self) -> 'UnicodeString':
        import unicodedata
        return UnicodeString(unicodedata.normalize('NFC', self._data))

    def code_points(self) -> int:
        return len(self._data)

    def graphemes(self) -> int:
        import grapheme
        return grapheme.length(self._data)

Comparison

def equals(a: str, b: str) -> bool:
    import unicodedata
    return unicodedata.normalize('NFC', a) == unicodedata.normalize('NFC', b)

Validation

def validate_utf8(data: bytes) -> str:
    try:
        return data.decode('utf-8')
    except UnicodeDecodeError as e:
        raise ParseError(f"Invalid UTF-8 at byte {e.start}")

Error Messages

Invalid Encoding

ERROR: Invalid UTF-8 encoding
  --> ledger.beancount:1:1
   |
   = file is not valid UTF-8
   = hint: convert using 'iconv -f <encoding> -t utf-8'

Prohibited Character

ERROR: Prohibited control character
  --> ledger.beancount:42:15
   |
42 |   note: "Text with <NUL> inside"
   |                     ^^^^^
   |
   = U+0000 (NULL) is not allowed

Cross-Format Notes

Feature Beancount Ledger hledger
Encoding UTF-8 only Multiple UTF-8
Normalization NFC None None
Case in accounts Sensitive Configurable Sensitive
Unicode escapes Yes Limited Yes
RTL support Limited Limited Limited