hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: write and sort performance
Date Fri, 08 Jun 2007 22:47:52 GMT

First of all, Hadoop is not optimized for small cluster or small bursts 
of writes/reads. There are some costs (like storing a copy locally and 
copying it locally) that don't have benefits for small clusters with .

You could try using different disks (not just partitions) for tmp 
directory for Maps and for Datanode.

To compare single node write with Hadoop, you should run 'bin/hadoop 
-copyFromLocal - test' and pipe your dd command output there. May be you 
will see 25% of 75MB you saw with native write. That is not unexpected. 
Not sure if you want to know all the details of why it is so. In your 
test you have many other one time costs of starting and stopping jobs etc.

I don't mean to say Hadoop can't do better.. its performance is steadily 
improving. But your expectations for toy application might be off.

If you want to figure out what the problem could be, you could start 
with 'copyFromLocal' example above. Here you need to figure our what 
Datanode process and Hadoop shell are doing at verious time (may be with 
stack traces).


Bwolen Yang wrote:
>> Please try Hadoop 0.13.0.  I don't know whether it will address your
>> concerns, but it should be faster and is much closer to what developers
>> are currently working on.
> ok. It would also be good to see how DFS upgrade go between versions.
> (looks like it got released today.  cool.)
>> For such a small cluster you'd probably be better running the jobtracker
>> and namenode on the same node and gain another slave.
> When namenode and jobtracker were running on the same machine, I
> notice failures due to losing contact with jobtracker.  This is why I
> split the machines.
> With regard to the performance details, it is really independent of
> how many slaves I have.   The test is mainly trying to see how close
> Hadoop compares to single node or scp, and what are the tuning
> parameters to make things run faster.
> Any suggestions on java profiling tools?
> bwolen

View raw message