hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nagarjuna kanamarlapudi <nagarjuna.kanamarlap...@gmail.com>
Subject Re: running a job on single-node setup takes less time than running on a cluster
Date Wed, 22 Aug 2012 03:46:16 GMT
Dear Mahsa,

Yes what you have observed is defined to happen that way.
On a single node cluster -- everything is local. There is network transfer
and every thing else vanish. Try to increase the data size .. you will see
the effect of parallel jvm's on the job time.

In your single node cluster, you have one jvm and everything is local.
In multinode , multiple jvm's and mapper ouput to be copied to reducer
(network transfer).

Comparing the above two situations.. may be your small data didnot reach
the threshold where you the observer of multinode cluster.

Try increasing the data size and you will see wonders. You know, I worked
on TB of data for table joins. It worked just amazing.

On Tue, Aug 21, 2012 at 12:01 AM, Mahsa Mofidpoor <mofidpoor@gmail.com>wrote:

> Thnaks Saurabh
> On Mon, Aug 20, 2012 at 12:15 PM, Saurabh bhutyani <s4saurabh@gmail.com>wrote:
>> Dear Mahsa,
>> You need to increase the data size to benefit out of Hadoop. Basically
>> hadoop creates splits based on the configured value. The default being
>> 64MB. So if your data size is less than 64MB it would basically run only 1
>> MR job.
>> Thanks & Regards,
>> Saurabh Bhutyani
>> Call  : 9820083104
>> Gtalk: s4saurabh@gmail.com
>> On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <mofidpoor@gmail.com>wrote:
>>> Hello,
>>> I run a simple join (select col_list from table1 join table2 on
>>> (join_condition)) on both single-node and multi-nodes  setup. The table
>>> sizes are 1.7 MB and 4.2 MB respectively.  It takes more time to execute
>>> the query on the cluster then to run it on a single-node hadoop setup.
>>> I checked to map logs and I saw that both mappings happen on the master
>>> node.
>>> Do I need to increase the data in order to benefit from the multi-nodes
>>> capacity?
>>> How can I make sure that my data is distributed on all the nodes?
>>> Thank you in advance for your assistance.
>>> Reagrds,
>>> Mahsa

View raw message