hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Parsing and moving data to ORC from HDFS
Date Wed, 22 Apr 2015 19:59:14 GMT

> In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
>filters might move to GA?

I¹d say that it will be in the next release, considering it is already
checked into hive-trunk.

Bloom filters aren¹t too crazy today. They are written within the ORC file
right next to the row-index data, so that there¹s no staleness issues with
this today & after that they¹re fairly well-understood structures.

I¹m working through ³bad use² safety scenarios like someone searching for
³11² (as a string) in a data-set which contains doubles.

Hive FilterOperator casts this dynamically, but the ORC PPD has to do
those type promotions exacty as hive would do in FilterOperator throughout
the bloom filter checks.

Calling something production-ready needs that sort of work, rather than
the feature¹s happy path of best performance.

> The data is single-line text events. Nothing fancy, no multiline or any
>binary. Each event is 200 - 800 bytes long.
> The format of these events are in 5 types (from which application
>produce them) and none are JSON. I wrote a small lib with 5 Java classes
> which interface parse(String raw) and return a JSONObject - utilized in
>my Storm bolts.

You could define that as a regular 1 column TEXTFILE and use a non-present
character as a delimiter (like ^A), which means you should be able to do
something like

select x.a, x.b, x.c from (select parse_my_format(line) as x from

a UDF is massively easier to write than a SerDe.

I effectively do something similar with get_json_object() to extract 1
column out (FWIW, Tez SimpleHistoryLogging writes out a Hive table).

> So I need to write my own format reader, a custom SerDe - specifically
>the Deserializer part? Then 5 schema-on-read external tables using my
>custom SerDe.
> That doesn't sound too bad! I expect bugs :)

Well, the UDF returning a Struct is an alternative to writing a SerDe.

> This all is just to catch up and clean our historical, garbage bin of
>data which piled up while we got Kafka - Storm - Elasticsearch running :-)

One problem at a time, I guess.

If any of this needs help, that¹s the sort of thing this list exists for.


View raw message