hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: SequenceFile as map input
Date Fri, 09 Jul 2010 18:42:02 GMT
Hi Alan,

You don't need to do this complex trickery if you write <Object,Text> to the
Sequence File.  How do you create the Sequence File?  In your case it might
make sense to create a <Text,Text> Sequence File where the first object is
the file name or compete path and the second is the content.

Then you just call:

process_line*(*value.toString*()*, context*)*;

without having to do the StringBuffer thing.

Alex K

On Fri, Jul 9, 2010 at 10:10 AM, Alan Miller <somebody@squareplanet.de>wrote:

>  Hi Alex,
>
> My original files are ascii text. I was using <Object, Text, Text, Text>
> and everything worked fine.
> Because my files are small (>2MB on avg.) I get one-map task per file.
> For my test I had 2000 files, totalling 5GB and the whole run took approx
> 40 minutes.
>
> I read that I could improve performance by merging my original files into
> one big SequenceFile.
>
> I did that and that's why I trying to use <Object, BytesWritable, Text,
> Text>
> My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks but
> apparently my new
> map() is computationally more intensive and the whole run now takes 64
> minutes.
>
> In my map(Text key, BytesWritable value, Context context)  value contains
> the contents
> of a whole file. I tried to break it down into line-based records which I
> send to reduce().
>
>    StringBuilder line = *new* StringBuilder*()*;
>    *char* linefeed = '\n';
>    *for* *(**byte* byt : value.getBytes*())* *{*
>        *if* *(* *(**int**)*byt == *(**int**)*linefeed  *)*  *{*
>           line.append*((**char**)*byt*)*;
>           process_line*(*line.toString*()*, context*)*;
>           line.delete*(*0, line.length*())*;
>        *}* *else* *{*
>           line.append*((**char**)*byt*)*;
>        *}*
>    *}*
>
> Alan
>
>
> On 07/08/2010 11:22 PM, Alex Kozlov wrote:
>
> Hi Alan,
>
> Is the content of the original file ascii text?  Then you should be using
> <Object, Text, Text, Text> signature.  By default 'hadoop fs -text ...'
> just will call toString() on the object.  You get the object itself in the
> map() method and can do whatever you want with it.  If Text or BytesWritable
> does not work for you, you can always write your own class implementing
> Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.
>
> Let me know if you need more details how to do this.
>
> Alex K
>
> On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <somebody@squareplanet.de>wrote:
>
>>  Hi Alex,
>>
>> I'm not sure what you mean. I already set my mapper's signature to:
>>
>>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
>> {
>>       ...
>>      public void map(Text key, BytesWritable value, Context context)
>>      }
>>    }
>>
>> In my map() loop the contents of value is the text from the original file
>> and the value.toString() returns a String of bytes as hex pairs separated
>> by space.
>> But I'd like the original tab separated list of strings (i.e. the lines in
>> my original files).
>>
>> I see BytesWritable.getBytes() returns a byte[]. I guess I could write my
>> own
>> RecordReader to convert the byte[] back to text strings but I thought this
>> is
>> something the framework would provide.
>>
>> Alan
>>
>>
>> On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
>>
>> Hi Alan,
>>
>>  SequenceFiles keep track of the key and value type, so you should be
>> able to use the Writables in the signature.  Though it looks like you're
>> using the new API, and I admit that I'm not an expert with the new API.
>>  Have you tried using the Writables in the signature?
>>
>> Alex
>>
>> On Thu, Jul 8, 2010 at 6:44 AM, Some Body <somebody@squareplanet.de>wrote:
>>
>>> To get around the small-file-problem (I have thousands of 2MB log files)
>>> I wrote
>>> a class to convert all my log files into a single SequenceFile in
>>> (Text key,  BytesWritable value) format.  That works fine. I can run
>>> this:
>>>
>>>    hadoop fs -text /my.seq |grep peemt114.log | head -1
>>>    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>>    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
>>> initialized native-zlib library
>>>    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
>>>    peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......
>>>
>>> which shows my file name key (peemt114.log)
>>> and file contents value which appears to be converted to hex.
>>> The hex values up to the first tab (09)  translate to my hostname.
>>>
>>> I'm trying to adapt my mapper to use the SequenceFile as input.
>>>
>>> I  changed the job's inputFormatClass to:
>>>    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
>>> and modified my mapper signature to:
>>>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
>>> {
>>>
>>> but how do I convert the value back to Text? When I print out the
>>> key,values using:
>>>        System.out.printf("MAPPER INKEY: [%s]\n", key);
>>>        System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
>>> I get::
>>>    MAPPER INKEY: [peemt114.log]
>>>    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>>>
>>> Alan
>>>
>>
>>
>>
>
>

Mime
View raw message