hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Should splittable Gzip be a "core" hadoop feature?
Date Tue, 28 Feb 2012 15:50:27 GMT
Hi,

Some time ago I had an idea and implemented it.

Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Mime
View raw message