hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@hortonworks.com>
Subject Re: Parquet support (HIVE-5783)
Date Tue, 18 Feb 2014 17:49:20 GMT
Gunther, is it the case that there is anything extra that needs to be done to ship Parquet
code with Hive right now?  If I read the patch correctly the Parquet jars were added to the
pom and thus will be shipped as part of Hive.  As long as it works out of the box when a user
says “create table … stored as parquet” why do we care whether the parquet jar is owned
by Hive or another project?

The concern about feature mismatch in Parquet versus Hive is valid, but I’m not sure what
to do about it other than assure that there are good error messages.  Users will often want
to use non-Hive based storage formats (Parquet, Avro, etc.).  This means we need a good way
to detect at SQL compile time that the underlying storage doesn’t support the indicated
data type and throw a good error.

Also, it’s important to be clear going forward about what Hive as a project is signing up
for.  If tomorrow someone decides to add a new datatype or feature we need to be clear that
we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC)
but not necessarily for external formats (Parquet, Avro).  

Alan.

On Feb 17, 2014, at 7:03 PM, Gunther Hagleitner <ghagleitner@hortonworks.com> wrote:

> Brock,
> 
> I'm not trying to "pick winners", I'm merely trying to say that the
> documentation/code should match what's actually there, so folks can make
> informed decisions.
> 
> The issue I have with the word "native" is that people have expectations
> when they hear it and I think these are not met.
> 
> I've had folks ask me why we're switching the default of hive to Parquet.
> This isn't the case obviously, but "native" to most people means just that:
> Hive's primary format. That's why I was asking for a title of "Add Parquet
> SerDe" for the jira. That's the exact same thing that was done for Avro
> under the exact same circumstances:
> https://issues.apache.org/jira/browse/HIVE-895.
> 
> Native also has other associations a) it supports the full data
> model/feature set and b) it's part of hive. Neither is the case and I don't
> think that's just a superficial difference. Support and usability will be
> different. That's why I think the documentation should delineate between
> RC/ORC/etc on one side and Parquet/Avro/etc on the other.
> 
> As mentioned in the jira "STORED AS" was reserved for what's actually part
> of hive (or hadoop core in the case of sequence file as you point out). I
> think there are reasons for that: a) being part of the grammar implies
> native as above b) you need to ship the code bundled in hive-exec for this
> to work (which is *broken* right now) and c) like you said we shouldn't
> pick winners by letting some of them become a keyword and others not. For
> these reasons I think Parquet should use the old syntax at this point. If
> you have a pluggable/configurable way great, but right now we don't have
> that.
> 
> Finally, yes, I am late to this party and I apologize for that. I'm happy
> to make the suggested changes myself, if that's the concern.
> 
> Thanks,
> Gunther.
> 
> 
> 
> On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland <brock@cloudera.com> wrote:
> 
>> Hi Gunther,
>> 
>> Please find my response inline.
>> 
>> On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner <gunther@apache.org>
>> wrote:
>>> I read through the ticket, patch and documentation
>> 
>> Thank you very much for reading through these items!
>> 
>>> and would like to
>>> suggest some changes.
>> 
>> There was ample time to suggest these changes prior to commit. The
>> JIRA was created three months ago, and the title you object to and the
>> patch was up there over two months ago.
>> 
>>> As far as I can tell this basically adds parquet SerDes to hive, but the
>>> file format remains external to hive. There is no way for hive devs to
>>> makes changes, fix bugs add, change datatypes, add features to parquet
>>> itself.
>> 
>> As stated in many locations including the JIRA discussed here, we
>> shouldn't be picking winner/loser file formats. We use many external
>> libraries, none of which, all Hive developers have the ability to
>> modify. For example most Hive developers do not have the ability to
>> modify Sequence File. Tez is also an external library which few Hive
>> developers can change.
>> 
>>> So:
>>> 
>>> - I suggest we document it as one of the built-in SerDes and not as a
>>> native format like here:
>>> https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
>>> - I vote for the jira to say "Add parquet SerDes to Hive" and not "Native
>>> support"
>> 
>> The change provides the ability to create a parquet table with Hive,
>> natively. Therefore I don't see the issue you have with the word
>> native.
>> 
>>> - I think we should revert the change to the grammar to allow "STORED AS
>>> PARQUET" until we have a mechanism to do that for all SerDes, i.e.:
>> someone
>>> picks up: HIVE-5976. (I also don't think this actually works properly
>>> unless we bundle parquet in hive-exec, which I don't think we want.)
>> 
>> Again, you could have provided this feedback many moons ago. I am
>> personally interested in HIVE-5976 but it's orthogonal to this issue.
>> That change just makes it easier and cleaner to add STORED AS
>> keywords. The contributors of the Parquet integration are not required
>> to fix Hive. That is our job.
>> 
>>> - We should revert the deprecated classes (At least I don't understand
>> how
>>> a first drop needs to add deprecated stuff)
>> 
>> The deprecated classes are shells (no actual code) to support existing
>> users of Parquet, of which there are many. I see no justification for
>> impacting existing users when the workaround is trivial and
>> non-impacting to any other user.
>> 
>>> In general though, I'm also confused on why adding this SerDe to the hive
>>> code base is beneficial. Seems to me that that just makes upgrading
>>> Parquet, bug fixing, etc more difficult by tying a SerDe release to a
>> Hive
>>> release. To me that outweighs the benefit of a slightly more involved
>> setup
>>> of Hive + serde in the cluster.
>> 
>> The Hive APIs, which are not clearly defined, have changed often in
>> the past few releases making maintaining a file format extremely
>> difficult. For example, 0.12 and 0.13 break most if not all external
>> code bases.
>> 
>> However, beyond that, the community felt it was beneficial to make
>> Parquet easier to use. If you are not interested in Parquet then
>> ignore it as this change does not impact you. Tez integration is
>> something which does not interest myself and many other Hive
>> developers. Indeed other than a few cursory reviews and a few times
>> where I championed the refactoring you guys were doing in order to
>> support Tez, I have ignored the Tez work.
>> 
>> Sincerely,
>> Brock
>> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message