Re "Amazon predictability", did you guys see this recent paper:
http://people.csail.mit.edu/tromer/cloudsec/
Also some addl background on "noisy neighbor effects":
http://bit.ly/4O7dHx
http://bit.ly/8zPvQd
Some interesting bits of information in there.
Patrick
Something Something wrote:
> Here are some of the answers:
>
>>> How many concurrent reducers run on each node? Default two?
> I was assuming 2 on each node would be the default. If not, this could be a
> problem. Please let me know.
>
>>> 'd suggest you spend a bit of time figuring where your MR jobs
> are spending their time?
> I agree. Will do some more research :)
>
>>> How much of this overall time is spent in reduce phase?
> Mostly time is spent in the Reduce phases, because that's where most of the
> critical code is.
>
>>> Are inserts to a new table?
> Yes, all inserts will always be in a new table. In fact, I disable/drop
> HTables during this process. Not using any special indexes, should I be?
>
>>> I'm a little surprised that all worked on the small instances, that your
> jobs completed.
> But, really, shouldn't Amazon guarantee predictability :) After all I am
> paying for these instances.. albeit a small amount!
>
>>> Are you opening a new table inside each task or once up in the config?
> I open HTable in the 'setup' method for each mapper/reducer, and close table
> in the 'cleanup' method.
>
>>> You have to temper the above general rule with the fact that...
> I will try a few combinations.
>
>>> How big is your dataset?
> This one in particular is not big, but the real production ones will be.
> Here's approximately how many rows get processed:
> Phase 1: 300 rows
> Phase 2 thru 8: 100 rows.
> (Note: Each phase does complex calculations on the row.)
>
> Thanks for the help.
>
>
> On Tue, Jan 26, 2010 at 10:36 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> How big is your dataset?
>>
>> J-D
>>
>> On Tue, Jan 26, 2010 at 8:47 AM, Something Something
>> <mailinglists19@gmail.com> wrote:
>>> I have noticed some strange performance numbers on EC2. If someone can
>> give
>>> me some hints to improve performance that would be greatly appreciated.
>>> Here are the details:
>>>
>>> I have a process that runs a series of Jobs under Hadoop 0.20.1 & Hbase
>>> 0.20.2 I ran the *exact* same process with following configurations:
>>>
>>> 1) 1 Master & 4 Workers (*c1.xlarge* instances) & 1 Zookeeper
>> (*c1.medium*)
>>> with *8 Reducers *for every Reduce task. The process completed in *849*
>>> seconds.
>>>
>>> 2) 1 Master, 4 Workers & 1 Zookeeper *ALL m1.small* instances with *8
>>> Reducers *for every Reduce task. The process completed in *906* seconds.
>>>
>>> 3) 1 Master, *11* Workers & *3* Zookeepers *ALL m1.small* instances with
>> *20
>>> Reducers *for every Reduce task. The process completed in *984* seconds!
>>>
>>>
>>> Two main questions:
>>>
>>> 1) It's totally surprising that when I have 11 workers with 20 Reducers
>> it
>>> runs slower than when I have exactly same type of fewer machines with
>> fewer
>>> reducers..
>>> 2) As expected it runs faster on c1.xlarge, but the performance
>> improvement
>>> doesn't justify the high cost difference. I must not be utilizing the
>>> machine power, but I don't know how to do that.
>>>
>>> Here are some of the performance improvements tricks that I have learnt
>> from
>>> this mailing list in the past that I am using:
>>>
>>> 1) conf.set("hbase.client.scanner.caching", "30"); I have this for all
>>> jobs.
>>>
>>> 2) Using the following code every time I open a HTable:
>>> this.table = new HTable(new HBaseConfiguration(), "tablenameXYZ");
>>> table.setAutoFlush(false);
>>> table.setWriteBufferSize(1024 * 1024 * 12);
>>>
>>> 3) For every Put I do this:
>>> Put put = new Put(Bytes.toBytes(out));
>>> put.setWriteToWAL(false);
>>>
>>> 4) Change the No. of Reducers as per the No. of Workers. I believe the
>>> formula is: # of workers * 1.75.
>>>
>>> Any other hints? As always, greatly appreciate the help. Thanks.
>>>
>
|