hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Losing Records with Block Compressed Sequence File
Date Sat, 22 Jan 2011 12:45:20 GMT

2011/1/21 David Sinclair <dsinclair@chariotsolutions.com>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.

How big is the output file?
How many splits are created?

> Here are the exact numbers
> bz2      399,734
> deflate  399,770
> none     400,000
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
> anyone ever see this behavior?

I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.

Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.
Best regards,

Niels Basjes

View raw message