hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moore, Douglas" <Douglas.Mo...@thinkbiganalytics.com>
Subject Re: custom binary format
Date Fri, 12 Dec 2014 16:02:43 GMT
You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS
'full.class.name' for serde.

For tutorials, google for "adding custom serde", I found one from
Cloudera: 
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

Depending on your numbers (rows / file, bytes / file, files per time
interval, #containers || map slots, mem size per slot or container)
creating a split of your file may not be necessary to obtain good
performance.

- Douglas




On 12/12/14 2:17 AM, "Ingo Thon" <isthon@gmx.de> wrote:

>Dear List,
>
>
>I want to set up a DW based on Hive. However, my data does not come as
>handy csv files but as binary files in a proprietary format.
>
>The binary file  consists of
>- 1 header of a dynamic number of bytes, which can be read from the
>contents of the header
>   The header tells me how to parse the rows and how many bytes each row
>has.
>- n rows of k bytes, where k is defined within the header
>
>
>The solution I have in mind looks as follows
>- Write a custom InputFormat which chunks the data into blobs of length k
>but skips the bytes of the header. So I¹d have two parameters for the
>Inputformat. (bytes to skip, bytes per row)
>  Do I really have to build this myself or does sth. like this already
>exists? Worst case I could also remove the header prior to pushing the
>data into the hdfs
>- Write a custom SerDe to parse the Blobs. At least in theory easy.
>
>The coding part does not look to complicated, however, I¹m kind of
>struggling with how to compile and install such serde. I installed Hive
>from source and imported it into eclipse.
>I guess I¹ve to now build my own projectŠ. Still I¹m a little bit lost.
>Is there any tutorial which describes the process?
>And also is my general idea ok?
>
>thanks in advance


Mime
View raw message