hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Something Something <mailinglist...@gmail.com>
Subject Re: Loader for small files
Date Mon, 11 Feb 2013 19:10:01 GMT
David:  Your suggestion would add an additional step of copying data from
one place to another.  Not bad, but not ideal.  Is there no way to avoid
copying of data?

BTW, we have tried changing the following options to no avail :(

set pig.splitCombination false;

& a few other 'dfs' options given below:

mapreduce.min.split.size
mapreduce.max.split.size

Thanks.

On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> You could store your data in smaller block sizes. Do something like
> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> You might only need one of those parameters. You can verify the block size
> with
> hadoop fsck /small-block-input
>
> In your pig script, you'll probably need to set
> pig.maxCombinedSplitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something <mailinglists19@gmail.com>
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase.  Adding 'hadoop' user group.
> >
> > On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We are running into performance issues with Pig/Hadoop because our input
> >> files are small.  Everything goes to only 1 Mapper.  To get around
> this, we
> >> are trying to use our own Loader like this:
> >>
> >> 1)  Extend PigStorage:
> >>
> >> public class SmallFileStorage extends PigStorage {
> >>
> >>    public SmallFileStorage(String delimiter) {
> >>        super(delimiter);
> >>    }
> >>
> >>    @Override
> >>    public InputFormat getInputFormat() {
> >>        return new NLineInputFormat();
> >>    }
> >> }
> >>
> >>
> >>
> >> 2)  Add command line argument to the Pig command as follows:
> >>
> >> -Dmapreduce.input.lineinputformat.linespermap=500000
> >>
> >>
> >>
> >> 3)  Use SmallFileStorage in the Pig script as follows:
> >>
> >> USING com.xxx.yyy.SmallFileStorage ('\t')
> >>
> >>
> >> But this doesn't seem to work.  We still see that everything is going to
> >> one mapper.  Before we spend any more time on this, I am wondering if
> this
> >> is a good approach – OR – if there's a better approach?  Please let me
> >> know.  Thanks.
> >>
> >>
> >>
>
>

Mime
View raw message