hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: hdsf block size cont.
Date Thu, 17 Mar 2011 14:05:04 GMT
On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter <liors@infolinks.com> wrote:
> Hi,
> If I have is big gzip files (>>block size) does the M/R will split a single
> file to multiple blocks and send them to different mappers ?
> The behavior I currently see is that a map is still open per file (and not
> per block).

Yes this is true. This is the current behavior with GZip files (since
they can't be split and decompressed right out). I had somehow managed
to ignore the GZIP part of your question in the previous thread!

But still, 60~ files worth 15 GB total would mean at least 3 GB per
file. And seeing how they can't really be split out right now, it
would be good to have them use up only a single block. Perhaps for
these files alone you may use a block size of 3-4 GB, thereby making
these file reads more local for your record readers?

In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
GZIP files, though. 'Concatenated' GZIP files could be split
(HADOOP-6835) across mappers as well (as demonstrated in PIG-42).

Harsh J

View raw message