hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David LaBarbera <davidlabarb...@localresponse.com>
Subject Re: Loader for small files
Date Mon, 11 Feb 2013 18:29:56 GMT
You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input
/small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize 
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <mailinglists19@gmail.com> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
> 
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> Hello,
>> 
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>> 
>> 1)  Extend PigStorage:
>> 
>> public class SmallFileStorage extends PigStorage {
>> 
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>> 
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>> 
>> 
>> 
>> 2)  Add command line argument to the Pig command as follows:
>> 
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>> 
>> 
>> 
>> 3)  Use SmallFileStorage in the Pig script as follows:
>> 
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>> 
>> 
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>> 
>> 
>> 


Mime
View raw message