hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David LaBarbera <davidlabarb...@localresponse.com>
Subject Re: Loader for small files
Date Mon, 11 Feb 2013 20:38:54 GMT
What process creates the data in HDFS? You should be able to set the block size there and avoid
the copy.

I would test the dfs.block.size on the copy and see if you get the mapper split you want before
worrying about optimizing.

David

On Feb 11, 2013, at 2:10 PM, Something Something <mailinglists19@gmail.com> wrote:

> David:  Your suggestion would add an additional step of copying data from
> one place to another.  Not bad, but not ideal.  Is there no way to avoid
> copying of data?
> 
> BTW, we have tried changing the following options to no avail :(
> 
> set pig.splitCombination false;
> 
> & a few other 'dfs' options given below:
> 
> mapreduce.min.split.size
> mapreduce.max.split.size
> 
> Thanks.
> 
> On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> You could store your data in smaller block sizes. Do something like
>> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
>> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
>> You might only need one of those parameters. You can verify the block size
>> with
>> hadoop fsck /small-block-input
>> 
>> In your pig script, you'll probably need to set
>> pig.maxCombinedSplitSize
>> to something around the block size
>> 
>> David
>> 
>> On Feb 11, 2013, at 1:24 PM, Something Something <mailinglists19@gmail.com>
>> wrote:
>> 
>>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
>>> HBase.  Adding 'hadoop' user group.
>>> 
>>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> We are running into performance issues with Pig/Hadoop because our input
>>>> files are small.  Everything goes to only 1 Mapper.  To get around
>> this, we
>>>> are trying to use our own Loader like this:
>>>> 
>>>> 1)  Extend PigStorage:
>>>> 
>>>> public class SmallFileStorage extends PigStorage {
>>>> 
>>>>   public SmallFileStorage(String delimiter) {
>>>>       super(delimiter);
>>>>   }
>>>> 
>>>>   @Override
>>>>   public InputFormat getInputFormat() {
>>>>       return new NLineInputFormat();
>>>>   }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 2)  Add command line argument to the Pig command as follows:
>>>> 
>>>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>>> 
>>>> 
>>>> 
>>>> 3)  Use SmallFileStorage in the Pig script as follows:
>>>> 
>>>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>>> 
>>>> 
>>>> But this doesn't seem to work.  We still see that everything is going to
>>>> one mapper.  Before we spend any more time on this, I am wondering if
>> this
>>>> is a good approach – OR – if there's a better approach?  Please let me
>>>> know.  Thanks.
>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message