hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kulp <>
Subject Re: using the key from a SequenceFile
Date Thu, 19 Apr 2012 18:12:54 GMT
To answer my own question -- so that someone else may benefit some day -- I've found that there
is nothing special about key or value formats in a SequenceFile.  As has been noted, keys
are ignored.  Each new key/value pair is seen as a new row from Hive's perspective.  There's
no concept of using Writables, such as ArrayWritable, to create nested structures in a value
field that are automatically parsed by Hive.  There are no record delimiters known to SequenceFile.
 There's just an ignored key and a value that is just a byte stream.

Thus, the simplest approach is just to use the Lazy SerDe format to create a multi-column
row in an MR program that will be read by Hive.  For example, your MR program would set the
output format to SequenceFile and Text values.


The reducer (or mapper if no reducer) would send values to the collector with Control-A delimiters
between column values.  There are no special formats for numbers, for example, in this approach.
 For example,

output.collect(dummy, col1+ "\001" + col2)

In Hive, create your table with "STORED AS SEQUENCEFILE" and you should be golden.

You can presumably use one of the alternative serializers in your MR program, but I haven't
tried it, yet.


On Apr 19, 2012, at 8:52 AM, David Kulp wrote:

> But I'm not clear on how to write a single row of multiple values in my MR program, since
my only way to output data is to send values to the collector.  Are you saying that there's
no row delimiter and I simply make repeated calls to the collector, e.g.
> output.collect(null, row1col1)
> output.collect(null, row1col2)
> ...
> output.collect(null, row2col1)
> output.collect(null, row2col2)
> If that's the case, then there's no explicit row boundary in the data, which also implies
that there's no reliable way to split such a file later when hive does an MR.
> Or is it along the lines of the following?
> ArrayList<Object> row;
> row.add(row1col1);  
> row.add(row1col2);
> output.collect(null, row);
> Thanks in advance!
> On Apr 19, 2012, at 8:21 AM, Ruben de Vries wrote:
>> Hive can handle a sequence file just like a text file, only it omits the key completely
and only uses the value part of it, other than that you won’t notice the difference between
sequence or plain text file
>> From: David Kulp [] 
>> Sent: Thursday, April 19, 2012 2:13 PM
>> To:
>> Subject: Re: using the key from a SequenceFile
>> I'm trying to achieve something very similar.  I want to write an MR program that
writes results in a record-based sequencefile that would be directly readable from hive as
though it were created using "STORED AS SEQUENCEFILE" with, say, BinarySortableSerDe.
>> From this discussion it seems that Hive does not / cannot take advantage of the key/values
in a sequencefile, but rather it requires a value that is serialized using a SerDe.  Is that
>> If so, does that mean that the right approach is to using the BinarySortableSerDe
to pass the collector a row's worth of data as the Writable value.  And would Hive "just work"
on such data?
>> If SequencefileOutputFormat is used, will it automatically place sync markers in
the file to allow for file splitting?
>> Thanks!
>> (ps. As an aside, Avro would be better.  Wouldn't it be a huge win for MapReduce
to have an AvroOutputFileFormat and for Hive to have a serde that read such files?  It seems
like there's a natural correspondence between the richer data representations of an SQL schema
and an Avro schema, and there's already code for working with Avro in MR as input.) 
>> On Apr 19, 2012, at 6:15 AM, madhu phatak wrote:
>> Serde will allow you to create custom data from your sequence File

>> On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <> wrote:
>> I’m trying to migrate a part of our current hadoop jobs from normal mapreduce jobs
to hive,
>> Previously the data was stored in sequencefiles with the keys containing valueable
>> However if I load the data into a table I loose that key data (or at least I can’t
access it with hive), I want to somehow use the key from the sequence file in hive.
>> I know this has come up before since I can find some hints of people needing it but
I can’t seem to find a working solution and since I’m not very good with java I really
can’t get it done myself L.
>> Does anyone have a snippet of something like this working?
>> I get errors like;
>> ../hive/mapred/ cannot find symbol
>>     [javac] symbol  : constructor SequenceFileRecordReader()
>>     [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader<K,V>
>>     [javac] public class CustomSeqRecordReader<K, V> extends SequenceFileRecordReader<K,
V> implements RecordReader<K, V> {
>> Hope some1 has a snippet or can help me out, would really love to be able to switch
part of our jobs to hive,
>> Ruben de Vries
>> -- 

View raw message