Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C2D71DCD6 for ; Fri, 10 Aug 2012 22:47:00 +0000 (UTC) Received: (qmail 31537 invoked by uid 500); 10 Aug 2012 22:46:58 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 31475 invoked by uid 500); 10 Aug 2012 22:46:58 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 31466 invoked by uid 99); 10 Aug 2012 22:46:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Aug 2012 22:46:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 209.85.216.169 as permitted sender) Received: from [209.85.216.169] (HELO mail-qc0-f169.google.com) (209.85.216.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Aug 2012 22:46:52 +0000 Received: by qcsd16 with SMTP id d16so1476121qcs.14 for ; Fri, 10 Aug 2012 15:46:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=hso9M9jrEruTAs/PcDZOlvB/LvPOP1KjShYPpZ2gNc8=; b=h7Wm/S+iI6bPuQNHdN1G7bq17wryQsEUAYE17m1rqd8idu8g8Ova/WKzBm2en4T7v7 k5lha39AJRqPcWN3gTmKdKWHTD0+agcoFuafBFYGvGwzhHHCAlUUKAyeQg7XjJ7M9mZa Y53f9Sp426fuc18KiviQCAF9jq08slaZzfR0NIOTcUnKm4CUDx/mWg+eFN1IUFRrHCGn YzQDK+MemDlN+7NT6EQThlhwmDZVSyYR3Mc0j39Q67gVnJByL6QZIFVVohAW5TdeI+Qh Kx5ss57z9uwRTH/1tdCQ/3/pi+RV1YfqZAGM3IgLLyKMTCNg0uRKO419VN577aAezkx4 oquQ== MIME-Version: 1.0 Received: by 10.224.222.78 with SMTP id if14mr2650331qab.55.1344638791840; Fri, 10 Aug 2012 15:46:31 -0700 (PDT) Sender: jdcryans@gmail.com Received: by 10.49.18.163 with HTTP; Fri, 10 Aug 2012 15:46:31 -0700 (PDT) In-Reply-To: <6548F17059905B48B2A6F28CE3692BAA0CE29691@OAEXCH4SERVER.oa.oclc.org> References: <52E301F960B30049ADEFBCCF1CCAEF59111989E3@OAEXCH4SERVER.oa.oclc.org> <6548F17059905B48B2A6F28CE3692BAA0CE29691@OAEXCH4SERVER.oa.oclc.org> Date: Fri, 10 Aug 2012 15:46:31 -0700 X-Google-Sender-Auth: gQthbak455k9vMDMCYmrp4d4Hyk Message-ID: Subject: Re: Extremely long flush times From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Can you jstack while this happens? Anything weird in the datanode log? Is the load high during that time? Thx, J-D On Fri, Aug 10, 2012 at 3:12 PM, Espinoza,Carlos wrote: > We were able to replicate this behavior in a pseudo-distributed hbase > (hbase-0.94.1) environment. We wrote a test program that creates a test > table "MyTestTable" and populates it with random rows, then it creates a > row with 60,000 columns and repeatedly updates it. Each column has a 18 > byte qualifier and a 50 byte value. In our tests, when we ran the > program, we usually never got beyond 15 updates before it would flush > for a really long time. The rows that are being updated are about 4MB > each (minues any hbase metadata). > > It doesn't seem like it's caused by GC. I turned on gc logging, and > didn't see any long pauses. This is the gc log during the flush. > http://pastebin.com/vJKKXDx5 > > This is the regionserver log with debug on during the same flush > http://pastebin.com/Fh5213mg > > This is the test program we wrote. > http://pastebin.com/aZ0k5tx2 > > You should be able to just compile it, and run it against a running > HBase cluster. > $ java TestTable > > Carlos > > -----Original Message----- > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of > Jean-Daniel Cryans > Sent: Monday, August 06, 2012 8:04 PM > To: user@hbase.apache.org > Subject: Re: Extremely long flush times > > Hi Ron, > > I think this is caused by GC, for example there's 7 minutes between the > last blocking message and the end of the flush: > > 2012-08-06 11:44:36,348 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC > Server handler 42 on 9009' on region > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6.: > memstore size 271.9m is >= than blocking 256.0m size > 2012-08-06 11:51:28,139 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of > ~155.4m/162905016, currentsize=116.5m/122172880 for region > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in > 501337ms, sequenceid=1792264092, compaction requested=true > > During that time I would expect that clients got socket timeouts, tried > to write again, and data would just pile up in the RPC queue. > > Do you have the gc log handy? Also did you set the zookeeper session > timeout higher than the default of 3 minutes? > > Thx, > > J-D > > On Mon, Aug 6, 2012 at 2:37 PM, Buckley,Ron wrote: >> We ran into this situation today on our production instance, I'm >> wondering if anyone has seen or knows a way around it. >> >> >> >> We were repeatedly updating the same 60,000 column wide row. With >> each update, we were updating nearly every column of the row. Each >> update generated nearly 9 MB in the WAL. >> >> This worked OK most of the time. We saw frequent messages like this in > >> the region server log, indicating a 2.6 second flush time: >> >> 2012-08-06 11:42:52,316 INFO >> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush >> of ~155.4m/162931400, currentsize=2.7k/2760 for region >> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in >> 2610ms, sequenceid=1792263953, compaction requested=false >> >> >> >> However, sometimes we would see a flush times around 8 - 10 Minutes. >> >> 2012-08-06 11:51:28,139 INFO >> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush >> of ~155.4m/162905016, currentsize=116.5m/122172880 for region >> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in >> 501337ms, sequenceid=1792264092, compaction requested=true >> >> >> >> During these long flushes, we would end up taking all the ipc handler >> slots into that region server that we had configured (50). It looked >> to our application like the region server was unavailable. >> >> >> >> After the long delay, the region server would continue running fine, >> but then do it again a few minutes later. >> >> >> >> I thought it was most likely something with Garbage Collection, so I >> got a stack trace during one of the events. The trace doesn't look >> like GC was running. >> >> >> >> >> >> http://pastebin.com/ANBiUzX4 - Stack Trace when the region server was > >> unreachable >> >> http://pastebin.com/naTYQxZR - Region Server Log during the same >> interval >> >> >> >> >> >> Versions: >> >> HBase 0.92.1, r1298924 >> >> Hadoop 0.20.2-cdh3u3 >> >> >> >> >> >> ---------------------------------------------------------------------- >> >> Ron Buckley >> >> x6365 >> >> >> > >