hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Kuo <kuosen...@gmail.com>
Subject Re: Problem with large .lzo files
Date Mon, 15 Feb 2010 16:07:23 GMT
On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Steve,
> I'm not sure here whether you mean that the DistributedLzoIndexer job
> is failing, or if the job on the resulting split file is faiing. Could
> you clarify?
DistributedLzoIndexer job did successfully complete.  It was one of the jobs
on the resulting split file always failed while other splits succeeded.

By the way, if all files have been indexed, DistributedLzoIndexer does not
detect that and hadoop throws an exception complaining that the input dir
(or file) does not exist.  I work around this by catching the exception.

> >   - It's possible to sacrifice parallelism by having hadoop work on each
> >   .lzo file without indexing.  This worked well until the file size
> exceeded
> >   30G when array indexing exception got thrown.  Apparently the code
> processed
> >   the file in chunks and stored the references to the chunk in an array.
>  When
> >   the number of chunks was greater than a certain number (around 256 was
> my
> >   recollection), exception was thrown.
> >   - My current work around is to increase the number of reducers to keep
> >   the .lzo file size low.
> >
> > I would like to get advices on how people handle large .lzo files.  Any
> > pointers on the cause of the stack trace below and best way to resolve it
> > are greatly appreciated.
> >
> Is this reproducible every time? If so, is it always at the same point
> in the LZO file that it occurs?
> It's at the same point.  Do you know how to print out the lzo index for the
task?  I only print out the input file now.

> Would it be possible to download that lzo file to your local box and
> use lzop -d to see if it decompresses successfully? That way we can
> isolate whether it's a compression bug or decompression.
> Bothe java LzoDecompressor and lzop -d were able to decompress the file
correctly.  As a matter of fact, my job does not index .lzo files now but
process each as a whole and it works

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message