hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: SequenceFile as map input
Date Thu, 08 Jul 2010 21:22:33 GMT
Hi Alan,

Is the content of the original file ascii text?  Then you should be using
<Object, Text, Text, Text> signature.  By default 'hadoop fs -text ...' just
will call toString() on the object.  You get the object itself in the map()
method and can do whatever you want with it.  If Text or BytesWritable does
not work for you, you can always write your own class implementing
Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.

Let me know if you need more details how to do this.

Alex K

On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <somebody@squareplanet.de>wrote:

>  Hi Alex,
>
> I'm not sure what you mean. I already set my mapper's signature to:
>
>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
> {
>      ...
>      public void map(Text key, BytesWritable value, Context context)
>      }
>    }
>
> In my map() loop the contents of value is the text from the original file
> and the value.toString() returns a String of bytes as hex pairs separated
> by space.
> But I'd like the original tab separated list of strings (i.e. the lines in
> my original files).
>
> I see BytesWritable.getBytes() returns a byte[]. I guess I could write my
> own
> RecordReader to convert the byte[] back to text strings but I thought this
> is
> something the framework would provide.
>
> Alan
>
>
> On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
>
> Hi Alan,
>
>  SequenceFiles keep track of the key and value type, so you should be able
> to use the Writables in the signature.  Though it looks like you're using
> the new API, and I admit that I'm not an expert with the new API.  Have you
> tried using the Writables in the signature?
>
> Alex
>
> On Thu, Jul 8, 2010 at 6:44 AM, Some Body <somebody@squareplanet.de>wrote:
>
>> To get around the small-file-problem (I have thousands of 2MB log files) I
>> wrote
>> a class to convert all my log files into a single SequenceFile in
>> (Text key,  BytesWritable value) format.  That works fine. I can run this:
>>
>>    hadoop fs -text /my.seq |grep peemt114.log | head -1
>>    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>>    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
>> initialized native-zlib library
>>    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
>>    peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......
>>
>> which shows my file name key (peemt114.log)
>> and file contents value which appears to be converted to hex.
>> The hex values up to the first tab (09)  translate to my hostname.
>>
>> I'm trying to adapt my mapper to use the SequenceFile as input.
>>
>> I  changed the job's inputFormatClass to:
>>    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
>> and modified my mapper signature to:
>>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
>> {
>>
>> but how do I convert the value back to Text? When I print out the
>> key,values using:
>>        System.out.printf("MAPPER INKEY: [%s]\n", key);
>>        System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
>> I get::
>>    MAPPER INKEY: [peemt114.log]
>>    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>>
>> Alan
>>
>
>
>

Mime
View raw message