hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Limit the number of open files in MultipleTextOutputFormat
Date Fri, 10 Jul 2009 17:06:17 GMT
On Fri, Jul 10, 2009 at 1:16 AM, Marcus Herou <marcus.herou@tailsweep.com>wrote:

> However I am sure that we have more keys than that in our production data
> so
> I guess hadoop will throw the "Too many open files" exception then.

Generally having lots of small files is very bad for performance.  It sounds
like you are headed that direction.

Consider spilling your data into a Mapfile, hbase or Voldemort.  That would
allow you to access your data by key much as you would use a file name with
multiple output files.  Make sure you try hbase 0.20 for performance

> I guess it is due to open/close stream efficiency that all streams are held
> open but I think that one can be tweaked to be more flexible.

This is also done because of the limitations on semantics that HDFS
imposes.  Files can only be written once.  Append is still in the future.

But aren't you grouping by your key in your reduce?  If so, you can close
each file as you finish processing the reduce group.

If you aren't grouping by your key, why not?  Run another step of MR and the
problem of too many open files will disappear completely.  That won't fix
the architectural problem of storing your data in lots of little files,

> Input ? Perhaps point me in the right direction and I can submit a "patch"
> writing this myself.

I think that this is the wrong approach because it will give you a
non-scalable system and is going to be difficult to do well because your
can't re-open files.  HDFS file names are not a good substitute for a
database because file lookup cannot be parallelized.

BUT ... if you think you can make the change in a way useful to others, the
process is very simple.  File an issue on JIRA, then attach a patch.  People
will comment on the patch and the automated test system will help you think
about how to make it better.  If you can convince the committers of the
utility of the patch, you are in.  Convincing them that contributions are
useful and safe is easier if you put your changes into the contrib rather
than trying to make the changes in core.

See here for more info:  http://wiki.apache.org/hadoop/HowToContribute

Be aware that Hadoop just splintered into several sub-projects due to the
rate of contributions and discussion.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message