hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Mains <andrew.ma...@kontagent.com>
Subject Re: custom binary format
Date Thu, 18 Dec 2014 23:18:27 GMT
So in hive you can actually do that via the SET command (documented here 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) as 
follows:

hive> SET fixedlengthinputformat.record.length = <length>

This value will be passed through to the JobConf, and the input format 
ought to pick it up from there.

This of course only works if you know the length of your records prior 
to reading the files, which it sounds like may not always be the case. 
In that case, you would probably need to subclass the input format, and 
override getRecordReader, with something like this (half pseudocode, sorry):

getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) {
      int recordLength = getRecordLengthFromHeader(genericSplit) // you 
would have to implement this to read the length from the header
      FixedLengthInputFormat.setRecordLength(conf, recordLength);
      return super.getRecordReader(genericSplit, conf, reporter);
   }

There might be some other logic around skipping the header bytes as well.

Hope this helps!

Andrew


On 12/18/14, 2:59 PM, Ingo Thon wrote:
> Hello Andrew,
>
> this one looks indeed like a good idea.
> However, there is also another Problem already here. This InputFormat expects that
> conf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, recordLength);
> is set. I haven’t found any way to specify a parameter for a InputFormat.
> I couldn’t find any way to specify it. Do you have any hints how to do it?
>
> Ingo
>
> On 18 Dec 2014, at 23:40, Andrew Mains <andrew.mains@kontagent.com> wrote:
>
>> Hi Ingo,
>>
>> Take a look at https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FixedLengthInputFormat.html--it
seems to be designed for use cases very similar to yours. You may need to subclass it to make
things work precisely the way you need (in particular, to deal with the header properly),
but I think it ought to be a good place to start.
>>
>> Andrew
>>
>> On 12/18/14, 2:25 PM, Ingo Thon wrote:
>>> Hi thanks for the answer so far, however, I still think there must be an easy
way.
>>> The file format I’m looking at is pretty simple.
>>> There is first an header of
>>> n bytes, Which can be ignored. After that there is the data.
>>> The data consists of rows where ich rows has 9 bytes.
>>> First there is a byte int (0..256), then there is an 8 byte int (0….)
>>>
>>> If I understand correctly lazy.LazySimpleSerDe should do the SerDe part.
>>> Is that right. so if I say schema TinyInt,Int64 a row consisting of 9 bytes will
be correctly parsed?
>>>
>>> The only thing missing would then be a proper input format.
>>> Ignoring the header format org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat
would actually doing the output part.
>>> Any hints how to do the input part?
>>>
>>> thanks in advance!
>>>
>>>
>>>
>>> On 12 Dec 2014, at 17:02, Moore, Douglas <Douglas.Moore@thinkbiganalytics.com>
wrote:
>>>
>>>> You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS
>>>> 'full.class.name' for serde.
>>>>
>>>> For tutorials, google for "adding custom serde", I found one from
>>>> Cloudera:
>>>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
>>>>
>>>> Depending on your numbers (rows / file, bytes / file, files per time
>>>> interval, #containers || map slots, mem size per slot or container)
>>>> creating a split of your file may not be necessary to obtain good
>>>> performance.
>>>>
>>>> - Douglas
>>>>
>>>>
>>>>
>>>>
>>>> On 12/12/14 2:17 AM, "Ingo Thon" <isthon@gmx.de> wrote:
>>>>
>>>>> Dear List,
>>>>>
>>>>>
>>>>> I want to set up a DW based on Hive. However, my data does not come as
>>>>> handy csv files but as binary files in a proprietary format.
>>>>>
>>>>> The binary file  consists of
>>>>> - 1 header of a dynamic number of bytes, which can be read from the
>>>>> contents of the header
>>>>>   The header tells me how to parse the rows and how many bytes each row
>>>>> has.
>>>>> - n rows of k bytes, where k is defined within the header
>>>>>
>>>>>
>>>>> The solution I have in mind looks as follows
>>>>> - Write a custom InputFormat which chunks the data into blobs of length
k
>>>>> but skips the bytes of the header. So I¹d have two parameters for the
>>>>> Inputformat. (bytes to skip, bytes per row)
>>>>> Do I really have to build this myself or does sth. like this already
>>>>> exists? Worst case I could also remove the header prior to pushing the
>>>>> data into the hdfs
>>>>> - Write a custom SerDe to parse the Blobs. At least in theory easy.
>>>>>
>>>>> The coding part does not look to complicated, however, I¹m kind of
>>>>> struggling with how to compile and install such serde. I installed Hive
>>>>> from source and imported it into eclipse.
>>>>> I guess I¹ve to now build my own projectŠ. Still I¹m a little bit
lost.
>>>>> Is there any tutorial which describes the process?
>>>>> And also is my general idea ok?
>>>>>
>>>>> thanks in advance


Mime
View raw message