Encoding Definitions
Plain¶
No compression is applied to the data. Depending on the data type, values are stored in the following formats:
- Boolean: Least-significant bit ordering within bytes
- Integer: Little-Endian byte order
- Long: Little-Endian byte order
- Float: IEEE754 floating-point numbers in Little-Endian byte order
- Double: IEEE754 floating-point numbers in Little-Endian byte order
- String: Length and data streams
Boolean-RLE¶
This encoding compresses boolean columns using least-significant bit numbering (bit-endianness). Refer to the ORC specification for implementation details.
Byte-RLE¶
This encoding compresses byte streams, such as the GeometryType
stream in Geometry columns.
Refer to the ORC specification for implementation details.
Dictionary Encoding¶
Dictionary encoding compactly represents repeated values and can be applied to String
and Geometry
columns.
Distinct values are stored in a dictionary
stream, while a separate data
stream stores indices into the dictionary.
We allow the following encodings of this concept:
String Dictionary Encoding¶
The dictionary stream contains distinct UTF-8 encoded string values.
The data stream contains dictionary indices encoded as UInt32
.
A dictionary-encoded nullable string column consists of the following streams in order:
Present
, Length
, Dictionary
, Data
Note
TODO: Is "can" (=> MAY) here the right terminoligy? This implies that streams might be encoded by one of the options below, but also other options.
If yes, what is the alternative options? If no, clarify this misunderstanding.
All streams can be further compressed using lightweight encodings:
- Present Stream: Boolean-RLE
- Length and Data Stream: See Integer encodings
- Dictionary: See FSST Dictionary
FSST Dictionary Encoding¶
Dictionary encoding requires fully repeating strings to reduce size. However, geospatial attributes often contain strings with common prefixes that are not identical (e.g., localized country names).
FSST replaces frequently occurring substrings while supporting efficient scans and random lookups. It further compresses UTF-8 encoded dictionary values in MLT.
An FSST dictionary-encoded nullable string column consists of the following streams in order:
Present
, SymbolLength
, SymbolTable
, String Length
, Dictionary
(Compressed Corpus)
For implementation details, refer to this paper.
Available implementations:
- C++: https://github.com/cwida/fsst
- Java: Work in progress
- JS/WebAssembly decoder: Work in progress (simple to implement)
Shared Dictionary Encoding¶
Shared dictionary encoding allows multiple columns to share a common dictionary.
Nested fields can also use a shared dictionary (e.g., for localized values like name:*
columns in an OSM dataset).
If used for nested fields, all fields sharing the dictionary must be grouped together in the file and prefixed with the dictionary.
A shared dictionary-encoded nullable string column consists of the following streams in order:
Length
, Dictionary
, Present1
, Data1
, Present2
, Data2
A shared FSST dictionary-encoded nullable string column consists of the following streams in order:
SymbolLength
, SymbolTable
, String Length
, Dictionary
(Compressed Corpus), Present1
, Data1
, Present2
, Data2
Vertex Dictionary Encoding¶
Uses an additional VertexOffsets
stream to store indices for vertex coordinates in the VertexBuffer
stream.
Vertices in the VertexBuffer are sorted using a Hilbert curve and delta-encoded with null suppression.
Morton Vertex Dictionary Encoding¶
VertexBuffer
coordinates are transformed into a 1D integer using Morton code.
The data is sorted by Morton code and further compressed using integer compression techniques.
Integer Encodings¶
Most data in MLT is stored as integer arrays, making efficient integer compression crucial.
Logical-Level Techniques¶
Integers are encoded using delta encoding, run-length encoding (RLE), or a combination of both (delta-RLE).
Delta Encoding¶
Computes differences between consecutive elements (e.g., x₂ - x₁, x₃ - x₂, ...). Used with physical-level techniques to reduce the number of bits required for storing delta values.
Run-Length Encoding (RLE)¶
Refer to Wikipedia for a basic explanation. Runs and values are stored in separate buffers. For unsigned integers, ZigZag encoding is applied to the values buffer. Both buffers are further compressed using null suppression.
Delta-RLE¶
Applies delta encoding followed by RLE.
Efficient for ascending sequences like id
fields.
Physical-Level Techniques¶
Null suppression techniques are used to compress integer arrays by reducing the number of bits required to store each integer.
ZigZag Encoding¶
Used for encoding unsigned integers in null suppression techniques. ZigZag encoding uses the least significant bit to represent the sign.
Varint Encoding¶
A byte-aligned null suppression technique that compresses integers using a minimal number of bytes. For implementation details, refer to Protobuf.
SIMD-FastPFOR¶
A bit-aligned null suppression technique that compresses integers using a minimal number of bits. Uses a patched approach to store exceptions (outliers) separately, keeping the overall bit width small.
Refer to: - https://arxiv.org/pdf/1209.2137.pdf - https://ayende.com/blog/199524-C/integer-compression-the-fastpfor-code
Available implementations:
- C++: https://github.com/fast-pack/FastPFOR
- Java: https://github.com/fast-pack/JavaFastPFOR
- C#: https://github.com/Genbox/CSharpFastPFOR
- JS/WebAssembly: Work in progress (higher implementation complexity)
Float Encodings¶
Adaptive Lossless Floating-Point Compression (ALP)¶
A lossless floating-point compression technique used in MLT. For a detailed explanation, refer to this paper.
Available implementations:
- C++: https://github.com/duckdb/duckdb/tree/16830ded9b4882b90f35c2d7e2d740547ae7d3fc/src/include/duckdb/storage/compression/alp
- Java: Work in progress
- JS/WebAssembly: Work in progress