hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Vargas <br...@ardvaark.net>
Subject Re: Handling Large Number Of Files, Fastest Way
Date Mon, 19 May 2008 02:54:49 GMT

You can realize a huge improvement by sticking them into a sequence 
file.  With lots of small files, name lookups against the name node will 
be a big bottleneck.

One easy approach is making the key be a Text of the filename that was 
loaded in, and the value be a BytesWritable, which is the contents of 
the file.  Since they're relatively small files (or you wouldn't be 
having this problem), you won't have to worry about OOMing yourself.  It 
worked really well for me, dealing with a few hundred thousand ~4MB files.


jkupferman wrote:
> Hi Everyone,
> I am working on a project which takes in data from a lot of text files, and
> although there are a lot of ways to do it, it is not clear to me which is
> the best/fastest. I am working on an EC2 cluster with approximately 20
> machines.
> The data is currently spread across 20k text files (total of a ~3gb ), each
> of which needs to be treated as a whole (no splits within those files), but
> I am willing to change around the format if I can get increased speed. Using
> the regular TextInputFormat adjusted to take in entire files is pretty slow
> since each file takes a minimum of about ~3 seconds no matter how small it
> is. 
>>>From what I have read the possible options to proceed with are as follows:
> 1. Use MultiFileInputSplit, it seems to be designed for this sort of
> situation, but I have yet to see an implementation of this, or a commentary
> on its performance increase over the regular input.
> 2. Read the data in, and output it as a Sequence File and use the sequence
> file as input from there on out.
> 3. Condense the files down to a small number of files (say ~100) and then
> delimit the files so each part gets a separate record reader. 
> If anyone could give me guidance as to what will provide the best
> performance for this setup, I would greatly appreciate it.
> Thanks for your help

View raw message