hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saptarshi Guha <saptarshi.g...@gmail.com>
Subject Re: Large(Thousands) of files -fast
Date Mon, 19 May 2008 03:52:07 GMT
Aah, use org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat as  
the inputformat.

On May 18, 2008, at 11:17 PM, Saptarshi Guha wrote:

> Hello,
> 	I have a similar scenario to jkupferman's situation - 1000's of  
> files mostly ranging from Kb,some  MBs and few of which GBs. I am  
> not too familiar with java and am  using
> 	hadoopstreaming with python. The mapper must work on individual  
> files.
> 	I've placed the 1000's of files into the DFS. 	
> 	I've given the map job a manifest listing locations of the files,  
> this is given to Hadoop which streams it to my python script which  
> then copies the specified filename and processes it.
> 	I also tried tar-ring the files, converting them into a sequence  
> file and then using SequenceFileAsTextInputFormat.
> 	The problem with this is that it sends the file contents as a  
> string representation of the bytes, which i would have to convert.
> 	Q: Is there any way I can  make it send me the data as  
> BytesWritable(mentioned below), using the command line and python?
> 	Thanks for your time.
> 	Regards	
> 	Saptarshi
> On May 18, 2008, at 10:54 PM, Brian Vargas wrote:
>> Hi,
>> You can realize a huge improvement by sticking them into a sequence  
>> file.  With lots of small files, name lookups against the name node  
>> will be a big bottleneck.
>> One easy approach is making the key be a Text of the filename that  
>> was loaded in, and the value be a BytesWritable, which is the  
>> contents of the file.  Since they're relatively small files (or you  
>> wouldn't be having this problem), you won't have to worry about  
>> OOMing yourself.  It worked really well for me, dealing with a few  
>> hundred thousand ~4MB files.
>> Brian
>> jkupferman wrote:
>>> Hi Everyone,
>>> I am working on a project which takes in data from a lot of text  
>>> files, and
>>> although there are a lot of ways to do it, it is not clear to me  
>>> which is
>>> the best/fastest. I am working on an EC2 cluster with  
>>> approximately 20
>>> machines.
>>> The data is currently spread across 20k text files (total of a  
>>> ~3gb ), each
>>> of which needs to be treated as a whole (no splits within those  
>>> files), but
>>> I am willing to change around the format if I can get increased  
>>> speed. Using
>>> the regular TextInputFormat adjusted to take in entire files is  
>>> pretty slow
>>> since each file takes a minimum of about ~3 seconds no matter how  
>>> small it
>>> is.
>>>> From what I have read the possible options to proceed with are as  
>>>> follows:
>>> 1. Use MultiFileInputSplit, it seems to be designed for this sort of
>>> situation, but I have yet to see an implementation of this, or a  
>>> commentary
>>> on its performance increase over the regular input.
>>> 2. Read the data in, and output it as a Sequence File and use the  
>>> sequence
>>> file as input from there on out.
>>> 3. Condense the files down to a small number of files (say ~100)  
>>> and then
>>> delimit the files so each part gets a separate record reader. If  
>>> anyone could give me guidance as to what will provide the best
>>> performance for this setup, I would greatly appreciate it.
>>> Thanks for your help
> Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
> The typewriting machine, when played with expression, is no more
> annoying than the piano when played by a sister or near relation.
> 		-- Oscar Wilde

Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
Back when I was a boy, it was 40 miles to everywhere, uphill both ways
and it was always snowing.

View raw message