Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 76723 invoked from network); 2 Jun 2009 03:53:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jun 2009 03:53:37 -0000 Received: (qmail 22123 invoked by uid 500); 2 Jun 2009 03:53:49 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 22048 invoked by uid 500); 2 Jun 2009 03:53:48 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 22038 invoked by uid 99); 2 Jun 2009 03:53:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jun 2009 03:53:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates 209.85.221.110 as permitted sender) Received: from [209.85.221.110] (HELO mail-qy0-f110.google.com) (209.85.221.110) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jun 2009 03:53:39 +0000 Received: by qyk8 with SMTP id 8so16639687qyk.5 for ; Mon, 01 Jun 2009 20:53:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=0JsjuSpWcJCw4WRbgWf5pdSJyIJEvIcjzJlrKvbugMo=; b=Vi5ZEcSyyW/9mO42SVcVIpaKmfFvF32pvHaOKD4jCfVnI5s9zuNt1iA1tdemh0pCS7 ICsLSpWkSo8LgsKTcG0NGkdo10N3EXrzfNuG3jSk0A+lCh4HHoahH6FKLGW4Uj/blQXs LE8LFW3J7WKo0Y7cVu0l3cK8KdbmRMbFZPgx0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=PjHKIeSFu7CbFxfFK5Pt88OXIHNC3vVTEbft0Inn7oWqCdsRvuTxfqvOsSFx8satmX EO8ciG9+wTsx0aTltf1t1rk8+kQO2AfhqcAQkJT9j5QIiW0K40aOVUtXCaIlyV8w+w69 YdzOhGXEE/320CfE/bva2TVqwFqBd/20ApFq0= MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.229.86.149 with SMTP id s21mr1953292qcl.25.1243914797096; Mon, 01 Jun 2009 20:53:17 -0700 (PDT) In-Reply-To: <7c962aed0906012036k3b0435d5o8a76e8ec8543bafc@mail.gmail.com> References: <7c962aed0906012036k3b0435d5o8a76e8ec8543bafc@mail.gmail.com> Date: Mon, 1 Jun 2009 20:53:17 -0700 X-Google-Sender-Auth: 4b0601a23e802350 Message-ID: <7c962aed0906012053h453e5f69mb245cebc45f88cc@mail.gmail.com> Subject: Re: master performance From: stack To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016364272b59ca735046b557bd2 X-Virus-Checked: Checked by ClamAV on apache.org --0016364272b59ca735046b557bd2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit And check that you have block caching enabled on your .META. table. Do "describe '.META.'" in the shell. Its on by default but maybe you migrated from an older version or something else got in the way of its working. St.Ack On Mon, Jun 1, 2009 at 8:36 PM, stack wrote: > What Ryan said and then can you try same test after a major compaction? > Does it make a difference? You can force it in shell by doing "hbase> > major_compaction '.META.'" IIRC (Type 'tools' in shell to get help > syntax). What size are your jobs? Short-lived? Seconds or minutes? Each > job needs to build up cache or region locations. To do this, its trip to > .META. Longer-lived jobs will save on trips to .META. Also, take a thread > dump when its slow ("kill -QUIT PID_OF_MASTER") and send it to us. Do it a > few times. We'll take a look see. > > Should be better in 0.20.0 but maybe a few things we can do meantime. > > St.Ack > > On Mon, Jun 1, 2009 at 5:31 PM, Jeremy Pinkham wrote: > >> >> sorry for the novel... >> >> I've been experiencing some problems with my hbase cluster and hoping >> someone can point me in the right direction. I have a 40 node cluster >> running 0.19.0. Each node has 4 cores, 8GB (2GB dedicated to the >> regionserver), and 1TB data disk. The master is on a dedicated machine >> separate from the namenode and the jobtracker. There is a single table with >> 4 column families and 3700 regions evenly spread across the 40 nodes. The >> TTL's match our loading pace well enough that we don't typically see too >> many splits anymore. >> >> In trying to troubleshoot some larger issues with bulk loads on this >> cluster I have created a test scenario to try and narrow the problem based >> on various symptoms. This test is map/reduce job that is using the >> HRegionPartitioner (as an easy way to generate some traffic to the master >> for meta data). I've been running this job with various size inputs to >> gauge the effect of different numbers of mappers and have found that as the >> number of concurrent mappers creeps up to what I think are still small >> numbers (<50 mappers), the performance of the master is dramatically >> impacted. I'm judging the performance here simply by checking the response >> time of the UI on the master, since that has historically been a good >> indication of when the cluster is getting into trouble during our loads >> (which I'm sure could mean a lot of things), although i suppose it's >> possible to two are unrelated. >> >> The UI normally takes about 5-7 seconds to refresh master.jsp. Running a >> job with 5 mappers doesn't seem to impact it too much, but a job with 38 >> mappers makes the UI completely unresponsive for anywhere from 30 seconds to >> several minutes during the run. During this time, there is nothing >> happening in the logs, scans/gets from within the shell continue to work >> fine, and ganglia/top show the box to be virtually idle. All links off of >> master.jsp work fine, so I presume it's something about the master pulling >> info from the individual nodes, but those UI's are also perfectly >> responsive. >> >> This same cluster used to run on just 20 nodes without issue, so I'm >> curious if I've crossed some threshold of horizontal scalability or if there >> is just a tuning parameter that I'm missing that might take care of this, or >> if there is something known between 0.19.0 and 0.19.3 that might be a >> factor. >> >> Thanks >> >> jeremy >> >> >> The information transmitted in this email is intended only for the >> person(s) or entity to which it is addressed and may contain confidential >> and/or privileged material. Any review, retransmission, dissemination or >> other use of, or taking of any action in reliance upon, this information by >> persons or entities other than the intended recipient is prohibited. If you >> received this email in error, please contact the sender and permanently >> delete the email from any computer. >> >> > --0016364272b59ca735046b557bd2--