hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: Best practices for handling many small files
Date Wed, 23 Apr 2008 16:16:31 GMT
million map processes are horrible. aside from overhead - don't do it if u share the cluster
with other jobs (all other jobs will get killed whenever the million map job is finished -
see https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of multiple files
in a single map job. it needs improvement. For one - it's an abstract class - and a concrete
implementation for (at least)  text files would help. also - the splitting logic is not very
smart (from what i last saw). ideally - it should take the million files and form it into
N groups (say N is size of your cluster) where each group has files local to the Nth machine
and then process them on that machine. currently it doesn't do this (the groups are arbitrary).
But it's still the way to go ..

-----Original Message-----
From: the.stuart.sierra@gmail.com on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files
Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that

-Stuart, altlaw.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message