Return-Path: X-Original-To: apmail-parquet-commits-archive@minotaur.apache.org Delivered-To: apmail-parquet-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DCE8910656 for ; Wed, 4 Mar 2015 20:09:20 +0000 (UTC) Received: (qmail 98217 invoked by uid 500); 4 Mar 2015 20:09:20 -0000 Delivered-To: apmail-parquet-commits-archive@parquet.apache.org Received: (qmail 98195 invoked by uid 500); 4 Mar 2015 20:09:20 -0000 Mailing-List: contact commits-help@parquet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@parquet.incubator.apache.org Delivered-To: mailing list commits@parquet.incubator.apache.org Received: (qmail 98186 invoked by uid 99); 4 Mar 2015 20:09:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2015 20:09:20 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 04 Mar 2015 20:09:18 +0000 Received: (qmail 93536 invoked by uid 99); 4 Mar 2015 20:08:58 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2015 20:08:58 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 71553E0AC5; Wed, 4 Mar 2015 20:08:58 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: blue@apache.org To: commits@parquet.incubator.apache.org Message-Id: <5e9eaf2b3cb84cc8930dbe23107fa5bf@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: incubator-parquet-format git commit: PARQUET-113: Add specs for LIST and MAP annotations. Date: Wed, 4 Mar 2015 20:08:58 +0000 (UTC) X-Virus-Checked: Checked by ClamAV on apache.org Repository: incubator-parquet-format Updated Branches: refs/heads/master e0e4ce153 -> 0e2e0a469 PARQUET-113: Add specs for LIST and MAP annotations. Draft specs for using `MAP` and `LIST` annotations. Please help verify that this can read all existing map and list data correctly! Author: Ryan Blue Closes #17 from rdblue/PARQUET-113-add-list-and-map-spec and squashes the following commits: 7c50699 [Ryan Blue] PARQUET-113: Clarify LIST and MAP annotations. eb627c7 [Ryan Blue] PARQUET-113: Add rules for maps written with Hive. 2515ffc [Ryan Blue] PARQUET-113: Clarify rules after working on implementations. 969a71e [Ryan Blue] PARQUET-113: Remove requirement for annotated repeated types. 3135c61 [Ryan Blue] PARQUET-113: Add specs for LIST and MAP annotations. Project: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/commit/0e2e0a46 Tree: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/tree/0e2e0a46 Diff: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/diff/0e2e0a46 Branch: refs/heads/master Commit: 0e2e0a469f3b2dd6b53210a89c851cbbf663fd6f Parents: e0e4ce1 Author: Ryan Blue Authored: Wed Mar 4 12:08:49 2015 -0800 Committer: Ryan Blue Committed: Wed Mar 4 12:08:49 2015 -0800 ---------------------------------------------------------------------- LogicalTypes.md | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 205 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/blob/0e2e0a46/LogicalTypes.md ---------------------------------------------------------------------- diff --git a/LogicalTypes.md b/LogicalTypes.md index e686a27..6bbd27a 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -148,3 +148,208 @@ primitive type. The `binary` data is interpreted as an encoded BSON document as defined by the [BSON specification][bson-spec]. [bson-spec]: http://bsonspec.org/spec.html + +## Nested Types + +This section specifies how `LIST` and `MAP` can be used to encode nested types +by adding group levels around repeated fields that are not present in the data. + +This does not affect repeated fields that are not annotated: A repeated field +that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated +by `LIST` or `MAP` should be interpreted as a required list of required +elements where the element type is the type of the field. + +Implementations should use either `LIST` and `MAP` annotations _or_ unannotated +repeated fields, but not both. When using the annotations, no unannotated +repeated types are allowed. + +### Lists + +`LIST` is used to annotate types that should be interpreted as lists. + +`LIST` must always annotate a 3-level structure: + +``` + group (LIST) { + repeated group list { + element; + } +} +``` + +* The outer-most level must be a group annotated with `LIST` that contains a + single field named `list`. The repetition of this level must be either + `optional` or `required` and determines whether the list is nullable. +* The middle level, named `list`, must be a repeated group with a single + field named `element`. +* The `element` field encodes the list's element type and repetition. Element + repetition must be `required` or `optional`. + +The following examples demonstrate two of the possible lists of string values. + +``` +// List (list non-null, elements nullable) +required group my_list (LIST) { + repeated group list { + optional binary element (UTF8); + } +} + +// List (list nullable, elements non-null) +optional group my_list (LIST) { + repeated group list { + required binary element (UTF8); + } +} +``` + +Element types can be nested structures. For example, a list of lists: + +``` +// List> +optional group array_of_arrays (LIST) { + repeated group list { + required group element (LIST) { + repeated group list { + required int32 element; + } + } + } +} +``` + +#### Backward-compatibility rules + +It is required that the repeated group of elements is named `array` and that +its element field is named `element`. However, these names may not be used in +existing data and should not be enforced as errors when reading. For example, +the following field schema should produce a nullable list of non-null strings, +even though the repeated group is named `element`. + +``` +optional group my_list (LIST) { + repeated group element { + required binary str (UTF8); + }; +} +``` + +Some existing data did not include the inner element layer. For +backward-compatibility, the type of elements in `LIST`-annotated structures +should always be determined by the following rules based on the repeated field: + +1. If the repeated field is not a group, then its type is the element type and + elements are required. +2. If the repeated field is a group with multiple fields, then its type is the + element type and elements are required. +3. If the repeated field is a group with one field and is named either "array" + or uses the `LIST`-annotated group's name with "tuple" appended then the + repeated type is the element type and elements are required. +4. Otherwise, the repeated field's type is the element type with the repeated + field's repetition. + +Examples that can be interpreted using these rules: + +``` +// List (nullable list, non-null elements) +optional group my_list (LIST) { + repeated int32 element; +} + +// List> (nullable list, non-null elements) +optional group my_list (LIST) { + repeated group element { + required binary str (UTF8); + required int32 num; + }; +} + +// List> (nullable list, non-null elements) +optional group my_list (LIST) { + repeated group array { + required binary str (UTF8); + }; +} + +// List> (nullable list, non-null elements) +optional group my_list (LIST) { + repeated group my_list_tuple { + required binary str (UTF8); + }; +} +``` + +### Maps + +`MAP` is used to annotate types that should be interpreted as a map from keys +to values. `MAP` must annotate a 3-level structure: + +``` + group (MAP) { + repeated group key_value { + required key; + value; + } +} +``` + +* The outer-most level must be a group annotated with `MAP` that contains a + single field named `key_value`. The repetition of this level must be either + `optional` or `required` and determines whether the list is nullable. +* The middle level, named `key_value`, must be a repeated group with a `key` + field for map keys and, optionally, a `value` field for map values. +* The `key` field encodes the map's key type. This field must have + repetition `required` and must always be present. +* The `value` field encodes the map's value type and repetition. This field can + be `required`, `optional`, or omitted. + +The following example demonstrates the type for a non-null map from strings to +nullable integers: + +``` +// Map +required group my_map (MAP) { + repeated group key_value { + required binary key (UTF8); + optional int32 value; + } +} +``` + +If there are multiple key-value pairs for the same key, then the final value +for that key must be the last value. Other values may be ignored or may be +added with replacement to the map container in the order that they are encoded. +The `MAP` annotation should not be used to encode multi-maps using duplicate +keys. + +#### Backward-compatibility rules + +It is required that the repeated group of key-value pairs is named `key_value` +and that its fields are named `key` and `value`. However, these names may not +be used in existing data and should not be enforced as errors when reading. + +Some existing data incorrectly used `MAP_KEY_VALUE` in place of `MAP`. For +backward-compatibility, a group annotated with `MAP_KEY_VALUE` that is not +contained by a `MAP`-annotated group should be handled as a `MAP`-annotated +group. + +Examples that can be interpreted using these rules: + +``` +// Map (nullable map, non-null values) +optional group my_map (MAP) { + repeated group map { + required binary str (UTF8); + required int32 num; + } +} + +// Map (nullable map, nullable values) +optional group my_map (MAP_KEY_VALUE) { + repeated group map { + required binary key (UTF8); + optional int32 value; + } +} +``` +