arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From u..@apache.org
Subject arrow git commit: ARROW-986: [Format] Add brief explanation of dictionary batches in IPC.md
Date Mon, 05 Jun 2017 10:20:39 GMT
Repository: arrow
Updated Branches:
  refs/heads/master 8f2b44b89 -> a44155d6e


ARROW-986: [Format] Add brief explanation of dictionary batches in IPC.md

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #732 from wesm/ARROW-986 and squashes the following commits:

4321106 [Wes McKinney] Add brief explanation of dictionary batches in IPC.md


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/a44155d6
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/a44155d6
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/a44155d6

Branch: refs/heads/master
Commit: a44155d6ec5d0c6c255d3305a494f51a6b1d2316
Parents: 8f2b44b
Author: Wes McKinney <wes.mckinney@twosigma.com>
Authored: Mon Jun 5 12:20:35 2017 +0200
Committer: Uwe L. Korn <uwelk@xhochy.com>
Committed: Mon Jun 5 12:20:35 2017 +0200

----------------------------------------------------------------------
 format/IPC.md | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/a44155d6/format/IPC.md
----------------------------------------------------------------------
diff --git a/format/IPC.md b/format/IPC.md
index bf2aaa7..7d68921 100644
--- a/format/IPC.md
+++ b/format/IPC.md
@@ -157,9 +157,24 @@ Some notes about this
 
 ### Dictionary Batches
 
-Dictionary batches have not yet been implemented, while they are provided for
-in the metadata. For the time being, the `DICTIONARY` segments shown above in
-the file do not appear in any of the file implementations.
+Dictionaries are written in the stream and file formats as a sequence of record
+batches, each having a single field. The complete semantic schema for a
+sequence of record batches, therefore, consists of the schema along with all of
+the dictionaries. The dictionary types are found in the schema, so it is
+necessary to read the schema to first determine the dictionary types so that
+the dictionaries can be properly interpreted.
+
+```
+table DictionaryBatch {
+  id: long;
+  data: RecordBatch;
+}
+```
+
+The dictionary `id` in the message metadata can be referenced one or more times
+in the schema, so that dictionaries can even be used for multiple fields. See
+the [Physical Layout][4] document for more about the semantics of
+dictionary-encoded data.
 
 ### Tensor (Multi-dimensional Array) Message Format
 
@@ -182,3 +197,4 @@ shared memory region) to be a multiple of 8:
 [1]: https://github.com/apache/arrow/blob/master/format/File.fbs
 [2]: https://github.com/apache/arrow/blob/master/format/Message.fbs
 [3]: https://github.com/google]/flatbuffers
+[4]: https://github.com/apache/arrow/blob/master/format/Layout.md


Mime
View raw message