arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject arrow git commit: ARROW-986: [Format] Add brief explanation of dictionary batches in
Date Mon, 05 Jun 2017 10:20:39 GMT
Repository: arrow
Updated Branches:
  refs/heads/master 8f2b44b89 -> a44155d6e

ARROW-986: [Format] Add brief explanation of dictionary batches in

Author: Wes McKinney <>

Closes #732 from wesm/ARROW-986 and squashes the following commits:

4321106 [Wes McKinney] Add brief explanation of dictionary batches in


Branch: refs/heads/master
Commit: a44155d6ec5d0c6c255d3305a494f51a6b1d2316
Parents: 8f2b44b
Author: Wes McKinney <>
Authored: Mon Jun 5 12:20:35 2017 +0200
Committer: Uwe L. Korn <>
Committed: Mon Jun 5 12:20:35 2017 +0200

 format/ | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/format/ b/format/
index bf2aaa7..7d68921 100644
--- a/format/
+++ b/format/
@@ -157,9 +157,24 @@ Some notes about this
 ### Dictionary Batches
-Dictionary batches have not yet been implemented, while they are provided for
-in the metadata. For the time being, the `DICTIONARY` segments shown above in
-the file do not appear in any of the file implementations.
+Dictionaries are written in the stream and file formats as a sequence of record
+batches, each having a single field. The complete semantic schema for a
+sequence of record batches, therefore, consists of the schema along with all of
+the dictionaries. The dictionary types are found in the schema, so it is
+necessary to read the schema to first determine the dictionary types so that
+the dictionaries can be properly interpreted.
+table DictionaryBatch {
+  id: long;
+  data: RecordBatch;
+The dictionary `id` in the message metadata can be referenced one or more times
+in the schema, so that dictionaries can even be used for multiple fields. See
+the [Physical Layout][4] document for more about the semantics of
+dictionary-encoded data.
 ### Tensor (Multi-dimensional Array) Message Format
@@ -182,3 +197,4 @@ shared memory region) to be a multiple of 8:

View raw message