hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Very weak mapred performance on small clusters with a massive amount of small files
Date Fri, 30 Nov 2007 17:34:11 GMT

I would suggest that you move to something like the Nutch architecture.

The general idea is that you have a list of new things to crawl that is
distributed to "map" functions that implement your crawler.  These crawl
everything in their list producing your 64MB files.  If one fails, then it
will be restarted and you won't lose the data.  If your crawl list is split
into small enough chunks, then each worker will have to munch on several
pieces so that if somebody gets a really slow part, others will take up the
slack.  You should also set the number of maps per machine high enough to
get lots of network traffic.

The reduce phase can be trivial or can actually do something interesting in
terms of coalescing different inputs.

On 11/30/07 3:04 AM, "André Martin" <mail@andremartin.de> wrote:

> Hi Joydeep, Ted, Bob, Doug and all Hadoopers,
> thanks for the suggestions & ideas. Here is a little update on what we
> have done in the meantime/the previous weeks(s):
> We have basically followed your suggestions/ideas and redesigned our
> system/directory structure the following way: We are using a (dynamic)
> hash to partition the data into 50 chunks/directories now (and got rid
> of millions of directories as it was the case before...)
> Unfortunately Hadoop does not yet support atomic appends :-( As
> workaround for this, we are using the copy&Merge function in order to
> consolidate all small chunks/files that arrive from our crawlers in a
> random and concurrent fashion.
> In our first approach/version, each crawler was responsible to
> copy&Merge preexisting data with its own chunk (if none of the other
> crawlers was doing the same job (hash) simultaneously). The major
> drawback of this solution was excessive traffic (between the DFS cluster
> and the crawlers) since copy&Merge is implemented on client side rather
> than on server/cluster side :-(
> In order to circumvent this problem, we have established a daemon that
> is located close to the cluster which performs the consolidation task
> periodically. By consolidating the small files to 64 MB chunks, we have
> reached a pretty good avg. MapRed performance of 10GB/20min (~8MB/s)
> We were also thinking of other approaches like the following: Each
> crawler opens its own stream and does not close it until it exceeds 64
> MB or so... unfortunately, if a crawler crashes without having its
> stream closed properly, all data which was put in the stream before is
> lost. Also, we have noticed, that the flush method does not really flush
> the data onto the DFS...
> New ideas and suggestions are always welcome. Thanks in advanced and
> have a great weekend!
> Cu on the 'net,
>                         Bye - bye,
>                                    <<<<< André <<<< >>>>
èrbnA >>>>>
> Joydeep Sen Sarma wrote:
>> Would it help if the multifileinputformat bundled files into splits based on
>> their location? (wondering if remote copy speed is a bottleneck in map)
>> If you are going to access the files many times after they are generated -
>> writing a job to bundle data once upfront may be worthwhile.
> Ted Dunning wrote:
>> I think that would help some, but the real problem for high performance is
>> disorganized behavior of the disk head.  If the MFIFormat could organize
>> files according to disk location as well and avoid successive file opens,
>> you might be OK, but that is asking for the moon.
> Bob Futrelle wrote:
>> In some cases, it should be possible to concatenate many files into one.
>> If necessary, a unique byte sequence could be inserted at each "seam".
>>  - Bob Futrelle
> Doug Cutting wrote:
>> Instead of organizing output into many directories you might consider
>> using keys which encode that directory structure.  Then mapreduce can
>> use these to partition output.  If you wish to mine only a subset of
>> your data, you can process just those partitions which contain the
>> portions of the keyspace you're interested in.
>> Doug
> Ted Dunning wrote:
>> A fairly standard approach in a problem like this is for each of the
>> crawlers to write multiple items to separate files (one per crawler).  This
>> can be done using a map-reduce structure by having the map input be items to
>> crawl and then using the reduce mechanism to gather like items together.
>> This structure can give you the desired output structure, especially if the
>> reduces are careful about naming their outputs well.

View raw message