hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. Pendleton" ...@geekdom.net>
Subject Re: Why part files are not merged ?
Date Wed, 19 Jul 2006 17:57:36 GMT
The implied reason is that it puts a bottleneck on I/O - to write one file
(with current HDFS semantics), the bytes for that file all have to pass
through a single host. So, you can have N reduces writing to HDFS in
parallel, or you can have one output file written from one machine. It also
means, in the current implementation, that you must have enuogh room (x2 or
x3 at this point) for that whole output file on a single drive of a single

Unless your output is not being read from Java, it's pretty easy to make
your next process read all of the output files in parallel. I've even done
this when generating MapFiles from jobs... there is code in place to make
this work already. Alternately, you can force there to be a single reducer
in the job settings.

On 7/19/06, Thomas FRIOL <thomas@anyware-tech.com> wrote:
> Hi all,
> Each reduce task produces one part file in the DFS. Why the job tracker
> does not merge them at the end of the job to produce only one file.
> It seems to me that it could be better to process results.
> I think there is certainly a reason for the actual behavior but I really
> need to get results of my map reduce job in a single file. Maybe someone
> can give me a clue to solve my problem.
> Thanks for any help.
> Thomas.
> --
> Thomas FRIOL
> Développeur Eclipse / Eclipse Developer
> Solutions & Technologies
> Tél      : +33 (0)561 000 653
> Portable : +33 (0)609 704 810
> Fax      : +33 (0)561 005 146
> www.anyware-tech.com

Bryan A. Pendleton
Ph: (877) geek-1-bp
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message