Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 23478 invoked from network); 16 Sep 2009 00:01:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Sep 2009 00:01:37 -0000 Received: (qmail 82018 invoked by uid 500); 16 Sep 2009 00:01:36 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 81950 invoked by uid 500); 16 Sep 2009 00:01:36 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 81940 invoked by uid 99); 16 Sep 2009 00:01:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2009 00:01:36 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [64.78.17.18] (HELO EXHUB018-3.exch018.msoutlookonline.net) (64.78.17.18) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2009 00:01:22 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-3.exch018.msoutlookonline.net ([64.78.17.18]) with mapi; Tue, 15 Sep 2009 16:59:51 -0700 From: Scott Carey To: "general@hadoop.apache.org" Date: Tue, 15 Sep 2009 16:59:45 -0700 Subject: Re: hadoop scales but is not performant? Thread-Topic: hadoop scales but is not performant? Thread-Index: Aco1ohy4R0PEAxULTzSBNeEZ+4RQ7AAtoMcSAAH+XeQ= Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_C6D5788111824scottrichrelevancecom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_C6D5788111824scottrichrelevancecom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Some more comments: How are you controlling your graph of different map reduce dependencies? D= oing this linearly will leave the cluster rather underutilized. Using Pig = or Cascading or the job control interface to schedule this all (which can s= ubmit parallel independent jobs), combined with the Fair Scheduler, will us= ually have very good results on smaller clusters in reducing overall batch = latency and increasing cluster utilization. This often allows setting the = number of reduces for each job more optimally as well. Although Pig will be a bit less performant than an optimized M/R, for tasks= like yours it can possibly go much faster with a lot less code by optimizi= ng your M/R passes, scheduling, and providing some alternate join strategie= s out of the box if used right. On 9/15/09 4:02 PM, "Scott Carey" wrote: Its not entirely invalid at this scale, though the dataset generally should= be (in memory) closer to the size of the available RAM of the cluster or m= ore to be a better test. Hadoop should still do well at this size relative to a single process. If = it is taking 45 minutes on a 10 machine cluster, it should be faster than a= ny single process. If this was limited primarily by startup costs, then in= creasing the data size wouldn't almost linearly increase the time. It can probably be significantly tuned. Some tips for jobs this size: Watch the CPU / RAM / IO on a node while it is running. Is it being well u= tilized? Watch the jobs in the HTML UI on the jobtracker. Are the total number of s= lots configured being used most of the time? Or is it spending a lot of c= lock time with a lot of empty slots? If so, which slots? Are Maps or Redu= ces taking the most time? Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task= per second or two per node). Using the Fair Scheduler and enabling some o= ptions can turn this up to 1 map and 1 reduce to schedule per 'tick'. Ther= e are some big changes in the schedulers due out in 0.21 that will signific= antly help here. For smaller low latency jobs this can make a big differen= ce. Hadoop also has some flaws in the shuffle phase that affect clusters of all= sizes, but can hurt small clusters with many small map jobs. Look at the= log file output of your reduce jobs (in the jobtracker UI) and see how lon= g the shuffle phase is taking. There is a big change due for 0.21 that makes this a LOT faster for some ca= ses, and low latency smaller jobs will benefit a lot too. https://issues.apache.org/jira/browse/MAPREDUCE-318 See my comment in that ticket from the 10th of June in relation to shuffle = times. A one line change in 0.19.x and 0.20.x cut times on smaller low lat= ency jobs on smaller clusters in half when limited by shuffle inefficiencie= s (specifically, fetching no more than one map output from one host every f= ew seconds for no good reason). Tuning several hadoop parameters might help a great deal as well. Clouder= a's configuration wizard is a good starting point to help you find which kn= obs are more important. You probably also want to do some work to figure out what is taking time on= your local tests. There might be something wrong there. It seems suspici= ous to me, even at the one document size, that it would take 2 + minutes ve= rsus less than one second. Startup costs aren't that big for 2 maps and 2 = reduces - on the order of a few seconds not minutes. A few seconds times a= few jobs is still not that much. On 9/14/09 6:14 PM, "Mat Kelcey" wrote: hi all, recently i've started playing with hadoop and my first learning experiment has surprised me i'm implementing a a text problem; trying to extract infrequent phrases using a mixture of probabilistic models. it's a bit of toy problem so the algorithm details aren't super important, though the implementation might be.... this problem ended up being represented by a dozen or so map reduce jobs of various types including some aggregate steps and some manual joins. (see the bottom of this page for details http://matpalm.com/sip/take3_markov_chains.html) i've implemented each step using ruby / streaming, the code is at http://github.com/matpalm/sip if anyone cares. i ran some tests using 10 ec2 medium cpu instance across a small 100mb's of gzipped text. for validation of the results i also reimplemented the entire algorithm as a single threaded ruby app my surprise comes from finding that the ruby implementation outperforms the 10 ec2 instances on this data size... i ran a few samples of different sizes with the graph at the bottom of http://matpalm.com/sip/part4_but_does_it_scale.html so why is this? here are my explanations in order of how confident i am... a) 100mb is peanuts and hadoop was made for 1000x this size so the test is invalid. b) there is a better representation of this problem that uses fewer map/reduce passes. c) streaming is too slow and rewriting in java (and making use of techniques like chaining mappers) would speed things up d) doing these steps, particularly the joins, in pig would be faster my next steps are to rewrite some of the steps in pig to sample the differe= nce does anyone have any high level comments on this? cheers, mat --_000_C6D5788111824scottrichrelevancecom_--