hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo" <guosi...@gmail.com>
Subject evaluate the size of the input & split them in parallel
Date Sun, 16 Nov 2008 09:27:00 GMT
Hi all,

When I am using Hadoop to do some Map/Reduce jobs over a large dataset(many
thousands of large input files), It seems that the client will take a little
long time to initial the job before actually running it. I am doubting that
it may be stucked during getting thousands of file's metadata from NameNode
and computing their splits.

Is there any way to evaluate the size of the input & construct their split
information in parallel. Can we run a light map/reduce job to construct the
split information in parallel before initializing a job? I think it's worth
constructing the job's split information in parallel when we encounter the
jobs with many thousands of input files.

Hope for reply.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message