hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-333) Add TFileTransport deserializer
Date Wed, 14 Apr 2010 23:22:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857143#action_12857143
] 

Joydeep Sen Sarma commented on HIVE-333:
----------------------------------------

i think a lot of stuff has changed in the hive code base since this patch was posted. for
sure - i think hive now uses the ASF namespace of thrift (org.apache.thrift) - which was a
big part of this patch (i think i bundled in a separate jar based on the asf distribution).


the other question is how the input thrift files are generated. previously the request was
for reading 'tfiletransport' formatted files. there's a lot of code in the patch - as well
as dependency on an uncommitted thrift patch for this reason. 

however - tfiletransport (despite it's early use in Facebook) is not used widely. it suffers
from numerous performance problems (single threaded performance sucks)  - as well as it bloats
the data. it has a useful property that the data is chunked - but my understanding is that
TFramedTransport and its ilk also may have similar properties.

so i think the first step may be to identify the starting container for Thrift files - since
the integration into Hive depends a lot on that.

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, hive-333.patch.2, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically,
I want to transfer some data from my web servers to hive  in a format that's a little more
verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am
planning on sending this serialized json or thrift data through scribe and loading it into
Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized
thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully
this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift
records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option
is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just
cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal
code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message