hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahsa Mofidpoor <mofidp...@gmail.com>
Subject Re: running a job on single-node setup takes less time than running on a cluster
Date Thu, 23 Aug 2012 16:19:06 GMT
Thank you very much.

On Tue, Aug 21, 2012 at 11:46 PM, nagarjuna kanamarlapudi <
nagarjuna.kanamarlapudi@gmail.com> wrote:

> Dear Mahsa,
>
> Yes what you have observed is defined to happen that way.
> On a single node cluster -- everything is local. There is network transfer
> and every thing else vanish. Try to increase the data size .. you will see
> the effect of parallel jvm's on the job time.
>
> In your single node cluster, you have one jvm and everything is local.
> In multinode , multiple jvm's and mapper ouput to be copied to reducer
> (network transfer).
>
> Comparing the above two situations.. may be your small data didnot reach
> the threshold where you the observer of multinode cluster.
>
> Try increasing the data size and you will see wonders. You know, I worked
> on TB of data for table joins. It worked just amazing.
>
>
>
> On Tue, Aug 21, 2012 at 12:01 AM, Mahsa Mofidpoor <mofidpoor@gmail.com>wrote:
>
>> Thnaks Saurabh
>>
>>
>> On Mon, Aug 20, 2012 at 12:15 PM, Saurabh bhutyani <s4saurabh@gmail.com>wrote:
>>
>>> Dear Mahsa,
>>>
>>> You need to increase the data size to benefit out of Hadoop. Basically
>>> hadoop creates splits based on the configured value. The default being
>>> 64MB. So if your data size is less than 64MB it would basically run only 1
>>> MR job.
>>>
>>> Thanks & Regards,
>>> Saurabh Bhutyani
>>>
>>> Call  : 9820083104
>>> Gtalk: s4saurabh@gmail.com
>>>
>>>
>>>
>>> On Mon, Aug 20, 2012 at 6:33 PM, Mahsa Mofidpoor <mofidpoor@gmail.com>wrote:
>>>
>>>> Hello,
>>>>
>>>> I run a simple join (select col_list from table1 join table2 on
>>>> (join_condition)) on both single-node and multi-nodes  setup. The table
>>>> sizes are 1.7 MB and 4.2 MB respectively.  It takes more time to execute
>>>> the query on the cluster then to run it on a single-node hadoop setup.
>>>> I checked to map logs and I saw that both mappings happen on the master
>>>> node.
>>>> Do I need to increase the data in order to benefit from the multi-nodes
>>>> capacity?
>>>> How can I make sure that my data is distributed on all the nodes?
>>>>
>>>> Thank you in advance for your assistance.
>>>>
>>>> Reagrds,
>>>> Mahsa
>>>>
>>>
>>>
>>
>

Mime
View raw message