hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <>
Subject Re: Parquet support (HIVE-5783)
Date Fri, 21 Feb 2014 14:27:02 GMT

Storage handlers muddle the waters a bit IMO. That interface was
written for storage that is not file-based, e.g. hbase. Whereas Avro,
Parquet, Sequence File, etc are all file based.

I think we have to be practical about confusion. There are so many
Hadoop newbies out there, almost all of them new to Apache as well,
that there is going to be some confusion. For example, one person who
had been using Hadoop and Hive for a few months said to me "Hive moved
*from* Apache to Hortonworks". At the end of the day, regardless of
what we do, some level of confusion is going to persist amongst those
new to the ecosystem.

With that said, I do think that an overview of "Hive Storage" would be
a great addition to our documentation.


On Fri, Feb 21, 2014 at 1:27 AM, Lefty Leverenz <> wrote:
> This is in the Terminology
> section<>
> of
> the Storage Handlers doc:
> Storage handlers introduce a distinction between *native* and
> *non-native* tables.
>> A native table is one which Hive knows how to manage and access without a
>> storage handler; a non-native table is one which requires a storage handler.
> It goes on to say that non-native tables are created with a STORED BY
> clause (as opposed to a STORED AS clause).
> Does that clarify or muddy the waters?
> -- Lefty
> On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz <>wrote:
>> Some of these issues can be addressed in the documentation.  The "File
>> Formats" section of the Language Manual needs an overview, and that might
>> be a good place to explain the differences between Hive-owned formats and
>> external formats.  Or the SerDe doc could be beefed up:  Built-In SerDes<>
>> .
>> In the meantime, I've added a link to the Avro doc in the "File Formats"
>> list and mentioned Parquet in DDL's Row Format, Storage Format, and SerDe<,StorageFormat,andSerDe>section:
>> Use STORED AS PARQUET (without ROW FORMAT SERDE) for the Parquet<>
>>> storage format in Hive 0.13.0 and later<>;
>>> 0.10, 0.11, or 0.12<>
>>> .
>> Does that work?
>> -- Lefty
>> On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland <> wrote:
>>> Hi Alan,
>>> Response is inline, below:
>>> On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates <>
>>> wrote:
>>> > Gunther, is it the case that there is anything extra that needs to be
>>> done to ship Parquet code with Hive right now?  If I read the patch
>>> correctly the Parquet jars were added to the pom and thus will be shipped
>>> as part of Hive.  As long as it works out of the box when a user says
>>> "create table ... stored as parquet" why do we care whether the parquet jar
>>> is owned by Hive or another project?
>>> >
>>> > The concern about feature mismatch in Parquet versus Hive is valid, but
>>> I'm not sure what to do about it other than assure that there are good
>>> error messages.  Users will often want to use non-Hive based storage
>>> formats (Parquet, Avro, etc.).  This means we need a good way to detect at
>>> SQL compile time that the underlying storage doesn't support the indicated
>>> data type and throw a good error.
>>> Agreed, the error messages should absolutely be good. I will ensure
>>> this is the case via
>>> >
>>> > Also, it's important to be clear going forward about what Hive as a
>>> project is signing up for.  If tomorrow someone decides to add a new
>>> datatype or feature we need to be clear that we expect the contributor to
>>> make this work for Hive owned formats (text, RC, sequence, ORC) but not
>>> necessarily for external formats
>>> This makes sense to me.
>>> I'd just like to add that I have a patch available to improve the
>>> hive-exec uber jar and general query speed:
>>> Additionally I have a
>>> patch available to finish the generic STORED AS functionality:
>>> Brock

Apache MRUnit - Unit testing MapReduce -

View raw message