hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ingo Thon <ist...@gmx.de>
Subject custom binary format
Date Fri, 12 Dec 2014 07:17:12 GMT
Dear List,


I want to set up a DW based on Hive. However, my data does not come as handy csv files but
as binary files in a proprietary format.

The binary file  consists of 
- 1 header of a dynamic number of bytes, which can be read from the contents of the header
   The header tells me how to parse the rows and how many bytes each row has.
- n rows of k bytes, where k is defined within the header


The solution I have in mind looks as follows
- Write a custom InputFormat which chunks the data into blobs of length k but skips the bytes
of the header. So I’d have two parameters for the Inputformat. (bytes to skip, bytes per
row)
  Do I really have to build this myself or does sth. like this already exists? Worst case
I could also remove the header prior to pushing the data into the hdfs
- Write a custom SerDe to parse the Blobs. At least in theory easy.

The coding part does not look to complicated, however, I’m kind of struggling with how to
compile and install such serde. I installed Hive from source and imported it into eclipse.
I guess I’ve to now build my own project…. Still I’m a little bit lost. Is there any
tutorial which describes the process?
And also is my general idea ok?

thanks in advance
Mime
View raw message