hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mickael Lacour (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8359) Map containing null values are not correctly written in Parquet files
Date Tue, 18 Nov 2014 09:36:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215985#comment-14215985
] 

Mickael Lacour commented on HIVE-8359:
--------------------------------------

[~brocknoland], normally I picked the patch that [~rdblue] told me about (the review on the
Review Board), but maybe not the last version.

[~rdblue] wanted me to update this patch to handle the HIVE-6994 instead of having two patches
that will have the same behavior/code. And I like the way [~spena] wrote the solution (better
than mine in my opinion).

[~spena], basically I modified the WritableGroupConverter to clean the 'current value'. If
you don't do that, you will never have a null value inside an array, but the previous one.
{code}
diff --git ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java
index 582a5df..052b36d 100644
--- ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java
+++ ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java
@@ -54,6 +54,7 @@ public void start() {
     if (isMap) {
       mapPairContainer = new Writable[2];
     }
+    currentValue = null;
   }
 
   @Override
{code}

And the second part was to add "Null" values from the ParquetHiveSerDe (values that I was
skipping before for no valid reason).

{code}
diff --git ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
index b689336..4b36767 100644
--- ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
+++ ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
@@ -202,13 +202,11 @@ private ArrayWritable createArray(final Object obj, final ListObjectInspector
in
     if (sourceArray != null) {
       for (final Object curObj : sourceArray) {
-        final Writable newObj = createObject(curObj, subInspector);
-        if (newObj != null) {
-          array.add(newObj);
-        }
+        array.add(createObject(curObj, subInspector));
       }
     }
     if (array.size() > 0) {
-      final ArrayWritable subArray = new ArrayWritable(array.get(0).getClass(),
+      final ArrayWritable subArray = new ArrayWritable(Writable.class,
           array.toArray(new Writable[array.size()]));
       return new ArrayWritable(Writable.class, new Writable[] {subArray});
     } else {
{code}

And the qtest was just to be sure to handle empty array, null array, array with null, and
the same for map.

{code}
+++ data/files/parquet_array_null_element.txt
@@ -0,0 +1,3 @@
+1|,7|CARRELAGE,MOQUETTE|key11:value11,key12:value12,key13:value13
+2|,|CAILLEBOTIS,|
+3|,42,||key11:value11,key12:,key13:
{code}


If you want to integrate them into your patch, feel free to do it, else I might want to duplicate
your patch (:p) and add this fix.

> Map containing null values are not correctly written in Parquet files
> ---------------------------------------------------------------------
>
>                 Key: HIVE-8359
>                 URL: https://issues.apache.org/jira/browse/HIVE-8359
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>    Affects Versions: 0.13.1
>            Reporter: Frédéric TERRAZZONI
>            Assignee: Sergio Peña
>         Attachments: HIVE-8359.1.patch, HIVE-8359.2.patch, HIVE-8359.4.patch, map_null_val.avro
>
>
> Tried write a map<string,string> column in a Parquet file. The table should contain
:
> {code}
> {"key3":"val3","key4":null}
> {"key3":"val3","key4":null}
> {"key1":null,"key2":"val2"}
> {"key3":"val3","key4":null}
> {"key3":"val3","key4":null}
> {code}
> ... and when you do a query like {code}SELECT * from mytable{code}
> We can see that the table is corrupted :
> {code}
> {"key3":"val3"}
> {"key4":"val3"}
> {"key3":"val2"}
> {"key4":"val3"}
> {"key1":"val3"}
> {code}
> I've not been able to read the Parquet file in our software afterwards, and consequently
I suspect it to be corrupted. 
> For those who are interested, I generated this Parquet table from an Avro file. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message