hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kjell Tore Fossbakk <>
Subject Re: Parsing and moving data to ORC from HDFS
Date Wed, 22 Apr 2015 18:49:33 GMT
Hey Gopal.
Thanks for your answers. I did some followups;

On Wed, Apr 22, 2015 at 3:46 PM, Gopal Vijayaraghavan <>

> > I have about 100 TB of data, approximately 180 billion events, in my
> >HDFS cluster. It is my raw data stored as GZIP files. At the time of
> >setup this was due to "saving the data" until we figured out what to do
> >with it.
> >
> > After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
> >last week I was amazed by the results presented.
> I run at the very cutting edge of the builds all the time :)
> The bloom filters are there in hive-1.2.0 which is currently sitting in
> svn/git today.
In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
filters might move to GA?

> > I have decided I will move my raw-data into HIVE using ORC and zlib. How
> >would you guys recommend I would do that?
> The best mechanism is always to write it via a Hive SQL ETL query.
> The real question is how the events are exactly organized. Is it a flat
> structure with something like a single line of JSON for each data item?

The data is single-line text events. Nothing fancy, no multiline or any
binary. Each event is 200 - 800 bytes long. The format of these events are
in 5 types (from which application produce them) and none are JSON. I wrote
a small lib with 5 Java classes which interface parse(String raw) and
return a JSONObject - utilized in my Storm bolts.

> That is much more easy to process than other data formats - the gzipped
> data can be natively read by Hive without any trouble.
> The Hive-JSON-Serde is very useful for that, because it allows you to read
> random data out of the system - each ³view² would be an external table
> enforcing a schema onto a fixed data set (including maps/arrays).
> You would create maybe 3-4 of these schema-on-read tables, then insert
> into your ORC structures from those tables.
> If you had binary data, then it would be much easier to write a convertor
> to JSON & then follow the same process as well instead of attempting a
> direct ORC writer, if you want >1 views out of the same table using
> external tables.

So I need to write my own format reader, a custom SerDe - specifically the
Deserializer part? Then 5 schema-on-read external tables using my custom
SerDe. And then again 5 new tables created as select statements from each
of the 5 external tables stored as ORC with e.g. zlib?

That doesn't sound too bad! I expect bugs :)

> > 2) write a storm-topology to read the parsed_topic and stream them to
> >Hive/ORC.
> You need to effectively do that to keep a live system running.
> We¹ve had some hiccups with the ORC feeder bolt earlier with the <2s ETL
> speeds (see
> That needs some metastore tweaking to work perfectly (tables to be marked
> transactional etc), but nothing beyond config params.

Ok. Thanks for the tip. We put all the data into Elasticsearch which is
searchable for N days. Thus, it is not a big priority to have the data in
Hive as quickly as possible but Storm does seem like a nice place to put my
code to do this.

When we'r able to stream parsed JSON data from Kafka into Hive/ORC we'll
stop putting raw-gzip data onto HDFS.

This all is just to catch up and clean our historical, garbage bin of data
which piled up while we got Kafka - Storm - Elasticsearch running :-)

> > 3) use spark instead of map-reduce. Only, I dont see any benefits in
> >doing so with this scenario.
> The ORC writers in Spark (even if you merge the PR SPARK-2883) are really
> slow because they are built against hive-13.x (which was my ³before²
> comparison in all my slides).

Yes. I did play with hive < 13.x something, without tez. I liked your
performance data better :-)

> I really wish they¹d merge those changes into a release, so that I could
> make ORC+Spark fast.
> Cheers,
> Gopal
Kjell Tore

View raw message