hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <b...@cloudera.com>
Subject Re: Hive Parquet Reader and "repeated" field
Date Tue, 11 Nov 2014 23:14:30 GMT
On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote:
> While running "select * from parquet_requests", the whole thing crashes
> with the
> following exception:
>
>    > public ArrayWritableGroupConverter(final GroupType groupType, final
> HiveGroupConverter parent,
>    >    final int index) {
>    >   this.parent = parent;
>    >   this.index = index;
>    >   int count = groupType.getFieldCount();
>    >   if (count < 1 || count > 2) {
>    >     throw new IllegalStateException("Field count must be either 1 or 2:
> " + count);
>    >   }
>    >
>
> What this means is that requests_tuple is not considered a valid list
> because
> it has more than one field. It basically expects the "repeated" keyword on
> the
> "requests (LIST)" as opposed to "requests_tuple". The actual code also does
> not
> seem to handle repeated on primitives since the ETypeConverters always call
> parent.set() hence always replacing the previous stored instance.
>
> I cooked up a patch which as far as I can tell would fix the issues here and
> I would like to have some comments to see if that patch is in the right
> direction
> before submitting a more formal pull request. Things need to be polished so
> please don't spend too much time on the form but more on the approach.
>
> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d
>
> Moreover, I have a feeling that I should probably not pass the thrift class
> for
> the parquet table given that at this point it is totally irrelevant and the
> parquet
> schema is stored in the parquet files. I also expect some ObjectInspector
> issue
> due to the extra grouping provided by the requests_tuple entry. Thoughts?
>
> Thanks,
>

Hi Jean-Pascal,

This is a known issue that we're going to be fixing shortly. The problem 
is that there's a difference in the way Hive and Thrift (or Avro) 
represents lists. PARQUET-113 [1] is an effort to define what is 
currently being written and what we need to do to add the compatibility. 
It also specifies what should be written.

Hive is one of the first object models that will be updated with the 
backward-compatibility rules so that it can read parquet-avro and 
parquet-thrift structures correctly.

rb

[1]: https://issues.apache.org/jira/browse/PARQUET-113

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Mime
View raw message