hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lior Schachter <li...@infolinks.com>
Subject Re: hdsf block size cont.
Date Thu, 17 Mar 2011 14:21:16 GMT
Currently each gzip file is about 250MB (*60files=15G) so we have 256M
blocks.

However I understand that in order to utilize better M/R parallel processing
smaller files/blocks are better.

So maybe having 128M gzip files with coreesponding 128M block size would be
better?


On Thu, Mar 17, 2011 at 4:05 PM, Harsh J <qwertymaniac@gmail.com> wrote:

> On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter <liors@infolinks.com>
> wrote:
> > Hi,
> > If I have is big gzip files (>>block size) does the M/R will split a
> single
> > file to multiple blocks and send them to different mappers ?
> > The behavior I currently see is that a map is still open per file (and
> not
> > per block).
>
> Yes this is true. This is the current behavior with GZip files (since
> they can't be split and decompressed right out). I had somehow managed
> to ignore the GZIP part of your question in the previous thread!
>
> But still, 60~ files worth 15 GB total would mean at least 3 GB per
> file. And seeing how they can't really be split out right now, it
> would be good to have them use up only a single block. Perhaps for
> these files alone you may use a block size of 3-4 GB, thereby making
> these file reads more local for your record readers?
>
> In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
> GZIP files, though. 'Concatenated' GZIP files could be split
> (HADOOP-6835) across mappers as well (as demonstrated in PIG-42).
>
> --
> Harsh J
> http://harshj.com
>

Mime
View raw message