Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 751 invoked from network); 10 Jun 2009 06:15:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jun 2009 06:15:50 -0000 Received: (qmail 56068 invoked by uid 500); 10 Jun 2009 06:16:02 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 56033 invoked by uid 500); 10 Jun 2009 06:16:02 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 56023 invoked by uid 99); 10 Jun 2009 06:16:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 06:16:02 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gcjhhu-hbase-user@m.gmane.org designates 80.91.229.2 as permitted sender) Received: from [80.91.229.2] (HELO ciao.gmane.org) (80.91.229.2) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 06:15:51 +0000 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1MEH5u-00031j-HG for hbase-user@hadoop.apache.org; Wed, 10 Jun 2009 06:15:30 +0000 Received: from adsl-99-164-139-150.dsl.ltrkar.sbcglobal.net ([99.164.139.150]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 10 Jun 2009 06:15:30 +0000 Received: from sales by adsl-99-164-139-150.dsl.ltrkar.sbcglobal.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 10 Jun 2009 06:15:30 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: hbase-user@hadoop.apache.org From: "Billy Pearson" Subject: Re: HBase Failing on Large Loads Date: Wed, 10 Jun 2009 01:15:18 -0500 Lines: 156 Message-ID: References: <860544ed0906091013k6dc054cfm3c8e52d8b52fdc6c@mail.gmail.com> <7c962aed0906091043lf4e4fd3p3766e12370c852cc@mail.gmail.com> <860544ed0906091151t1e782994nb630faf141ea3886@mail.gmail.com> <860544ed0906091715j71a644ccu17eeee29f60fd69e@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: adsl-99-164-139-150.dsl.ltrkar.sbcglobal.net In-Reply-To: <860544ed0906091715j71a644ccu17eeee29f60fd69e@mail.gmail.com> X-MSMail-Priority: Normal X-Newsreader: Microsoft Windows Mail 6.0.6001.18000 X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6001.18049 Sender: news X-Virus-Checked: Checked by ClamAV on apache.org I thank most of your problems are coming from running to many map/reduce task at the same time with so little memory and swapping and regionserver/datanodes/tasktrackers do not have time to check in to tell there masters that there alive still and stuff starts failing. I would try 2 maps 2 reduce per machine maybe 4 with that little memory. I run 3 mappers and 2 reducers per server with 4gb memory with 1gb heap for hbase/datanode/tasktracker and 400mb for task. Billy "Bradford Stephens" wrote in message news:860544ed0906091715j71a644ccu17eeee29f60fd69e@mail.gmail.com... I ran some more tests to clarify my questions from above. After the same MR job, 5 out of 8 of my Regionservers died before I terminated the job. Here's what I saw in one of the HBase Regionserver logs... Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 192.168.18.48:50010 (with many different IPs...) Then I get errors like this: Error Recovery for block blk_-4108085472136309132_97478 in pipeline 192.168.18.49:50010, 192.168.18.48:50010, 192.168.18.16:50010: bad datanode 192.168.18.48:50010 then things continue for a while and I get this: Exception while reading from blk_1698571189906026963_93533 of /hbase-0.19/joinedcontent/2018887968/content/mapfiles/3048972636250467459/data from 192.168.18.49:50010: java.io.IOException: Premeture EOF from inputStream Then I start seeing stuff like this: Error Recovery for block blk_3202913437369696154_99607 bad datanode[0] nodes == null 2009-06-09 16:31:15,330 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase-0.19/joinedcontent/compaction.dir/2018887968/content/mapfiles/2166568776864749492/data" - Aborting... Exception in createBlockOutputStream java.io.IOException: Could not read from stream Abandoning block blk_-4592653855912358506_99607 And this... DataStreamer Exception: java.io.IOException: Unable to create new block. Then it eventually dies. On Tue, Jun 9, 2009 at 11:51 AM, Bradford Stephens wrote: > I sort of need the reduce since I'm combining primary keys from a CSV > file. Although I guess I could just use the combiner class... hrm. > > How do I decrease the batch size? > > Also, I tried to make a map-only task that used ImmutableBytesWritable > and BatchUpdate as the output K and V, and TableOutputFormat as the > OutputFormat -- the job fails, saying that "HbaseMapWritable cannot be > cast to org.apache.hadoop.hbase.io.BatchUpdate". I've checked my > Mapper multiple times, it's definitely ouputting a BatchUpdate. > > On Tue, Jun 9, 2009 at 10:43 AM, > stack wrote: >> On Tue, Jun 9, 2009 at 10:13 AM, Bradford Stephens < >> bradfordstephens@gmail.com> wrote: >> >> >>> Hey rock stars, >>> >> >> >> Flattery makes us perk up for sure. >> >> >> >>> >>> I'm having problems loading large amounts of data into a table (about >>> 120 GB, 250million rows). My Map task runs fine, but when it comes to >>> reducing, things start burning. 'top' inidcates that I only have ~ >>> 100M of RAM free on my datanodes, and every process starts thrashing >>> ... even ssh and ping. Then I start to get errors like: >>> >>> "org.apache.hadoop.hbase.client.RegionOfflineException: region >>> offline: joinedcontent,,1244513452487" >>> >> >> See if said region is actually offline? Try getting a row from it in >> shell. >> >> >> >>> >>> and: >>> >>> "Task attempt_200906082135_0001_r_000002_0 failed to report status for >>> 603 seconds. Killing!" >> >> >> >> Sounds like nodes are heavily loaded.. so loaded either the task can't >> report in... or its stuck on an hbase update so long, its taking ten >> minutes >> or more to return. >> >> One thing to look at is disabling batching or making batches smaller. >> When >> batch is big, can take a while under high-load for all row edits to go >> in. >> HBase client will not return till all row commits have succeeded. Smaller >> batches will mean more likely to return and not have the task killed >> because >> takes longer than the report period to checkin. >> >> >> Whats your MR job like? Your updating hbase in the reduce phase i presume >> (TableOutputFormat?). Do you need the reduce? Can you update hbase in the >> map step? Saves on the sort the MR framework is doing -- a sort that is >> unnecessary given as hbase orders on insertion. >> >> >> Can you try with a lighter load? Maybe a couple of smaller MR jobs rather >> than one big one? >> >> St.Ack >> >> >>> >>> >>> I'm running Hadoop .19.1 and HBase .19.3, with 1 master/name node and >>> 8 regionservers. 2 x Dual Core Intel 3.2 GHz procs, 4 GB of RAM. 16 >>> map tasks, 8 reducers. I've set the MAX_HEAP in hadoop-env to 768, and >>> the one in hbase-env is at its default with 1000. I've also done all >>> the performance enchancements in the Wiki with the file handlers, the >>> garbage collection, and the epoll limits. >>> >>> What am I missing? :) >>> >>> Cheers, >>> Bradford >>> >> >