hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HIVE-333) Add TFileTransport deserializer
Date Sun, 12 Apr 2009 07:02:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698174#action_12698174
] 

Joydeep Sen Sarma edited comment on HIVE-333 at 4/12/09 12:00 AM:
------------------------------------------------------------------

this turned out to be way more complicated than i had thought. Here's the rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift trunk + thrift-377
(so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses com.facebook namespace
that is not compatible with thrift trunk. All of contrib/thrift uses the latest thrift trunk
version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate attachment
from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer)
that uses the thrift trunk library (instead of the old one). this is required to read thrift
objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and TFileTransportRecordReader - this
allows processing of TFileTransport files as inputs to hadoop map-reduce. it will split files
so that the splits are aligned with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows how to convert
a TFileTransport into a SequenceFile with thrift objects embedded inside BytesWritable objects.
This example does not do any reduction - but you can extend this example to hash/reduce  on
specific key (which is what we do at Facebook). Also output compression can be controlled
by command line options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a single jar file
by contrib/thrift (see build/contrib-thrift/hive_contrib-thrift.jar - should be produced by
ant jar or ant package).

ie. the current work done so far allows conversion of files in TFileTransport format into
SequenceFile +BytesWritable formats (and also provides the serde to read these files) that
are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar build/contrib-thrift/hive_contrib-thrift.jar
org.apache.hadoop.hive.thrift.TConverter -Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass
-inputpath /tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be added using
more -Dkey=value options.
// u will need to add the jar file for TestClass in this example also to the libjars switch

Once the files are converted - it's trivial to create a Hive table with the right properties
so that these files can be queries. a few points about hive integration:
- need to ask Prasad about the exact cli statements to create these tables - will post instructions
once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in hive execution
environment. This can be arranged by copying them into auxlib/ under the hive distribution
directory. i haven't integrated this into ant yet.
- also jar files for the classes that are serialized into the sequencefile and need to be
queries by hive need to be deposited into auxlibs/ as well.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not that hard (since
the inputformat is now done) - but it needs some more work and testing and more new code.


CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious records at the beginning
of each tfiletransport chunk when trying to read tfiletransport files produced in c++ land
from java land (and only when seeking to split boundaries). I just don't have the time to
debug anymore. the simple workaround is to <b> disable splitting of tfiletransport files
by setting mapred.min.split.size to an infinite value</b>. if the files are not spit
- there's no problem.

I am hoping you can take things from here. if we really really need hive to query tfiletransport
directly - it's probably another couple of hours worth of work - but i will wait for ur input
and see if this is required (seems to me that SequenceFiles are a better long term data container
in hadoop since they allow compression).

      was (Author: jsensarma):
    this turned out to be way more complicated than i had thought. Here's the rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift trunk + thrift-377
(so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses com.facebook namespace
that is not compatible with thrift trunk. All of contrib/thrift uses the latest thrift trunk
version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate attachment
from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer)
that uses the thrift trunk library (instead of the old one). this is required to read thrift
objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and TFileTransportRecordReader - this
allows processing of TFileTransport files as inputs to hadoop map-reduce. it will split files
so that the splits are aligned with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows how to convert
a TFileTransport into a SequenceFile with thrift objects embedded inside BytesWritable objects.
This example does not do any reduction - but you can extend this example to hash/reduce  on
specific key (which is what we do at Facebook). Also output compression can be controlled
by command line options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a single jar file
by contrib/thrift (see build/contrib-thrift/hive_contrib-thrift.jar - should be produced by
ant jar or ant package).

ie. the current work done so far allows conversion of files in TFileTransport format into
SequenceFile +BytesWritable formats (and also provides the serde to read these files) that
are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar build/contrib-thrift/hive_contrib-thrift.jar
org.apache.hadoop.hive.thrift.TConverter -Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass
-inputpath /tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be added using
more -Dkey=value options.

Once the files are converted - it's trivial to create a Hive table with the right properties
so that these files can be queries. a few points about hive integration:
- need to ask Prasad about the exact cli statements to create these tables - will post instructions
once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in hive execution
environment. This can be arranged by copying them into auxlib/ under the hive distribution
directory. i haven't integrated this into ant yet.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not that hard (since
the inputformat is not done) - but it needs some more work and testing and more new code.


BIG CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious records at the beginning
of each tfiletransport chunk when trying to read tfiletransport files produced in c++ land
from java land (and only when seeking to split boundaries). I just don't have the time to
debug anymore. the simple workaround is to disable splitting of tfiletransport files by setting
mapred.min.split.size to an infinite value. if the files are not spit - there's no problem.

I am hoping you can take things from here. if we really really need hive to query tfiletransport
directly - it's probably another couple of hours worth of work - but i will wait for ur input
and see if this is required (seems to me that SequenceFiles are a better long term data container
in hadoop since they allow compression).
  
> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically,
I want to transfer some data from my web servers to hive  in a format that's a little more
verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am
planning on sending this serialized json or thrift data through scribe and loading it into
Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized
thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully
this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift
records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option
is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just
cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal
code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message