Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@locus.apache.org Received: (qmail 17872 invoked from network); 30 Oct 2008 18:39:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Oct 2008 18:39:39 -0000 Received: (qmail 67234 invoked by uid 500); 30 Oct 2008 18:39:43 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 67204 invoked by uid 500); 30 Oct 2008 18:39:43 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 67193 invoked by uid 99); 30 Oct 2008 18:39:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2008 11:39:43 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of slava.gorelik@gmail.com designates 209.85.128.187 as permitted sender) Received: from [209.85.128.187] (HELO fk-out-0910.google.com) (209.85.128.187) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2008 18:38:30 +0000 Received: by fk-out-0910.google.com with SMTP id 26so568722fkx.13 for ; Thu, 30 Oct 2008 11:39:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=Hprxb1tLdk76Uv63RBjWHK1RbHthz6jaXmt/1BMQktM=; b=Eis5XdlGdHiyL01uitwPERnJfb5Ms3Hyelz1L0kys9Uhq7uzVx0PM7qbRN9iABByiJ Y1x0qb+d+C13s5u0zw8um8wzhyuLFs+jZnyuT0YKGixoR/Rtt7y0v5MO1nBqBhkRV7MT Qy8VKY2Pvz/Pjh1Sca/r+zTPMc+rCwYIHiTlk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=WihaP5p1GVM03ihOffoQjxFCaEMGYNG6WJK+R9wvVUE+osXfu75dphuKZKS/D4N19M zFSHulas7oT+IoAiAqtbxDEC5FkkTSmHVKDztEvpHVXz8pNvc0TJ7o8r3Xs6VqS1lJix oqfz8H1miWOMzYNGCPvuLutORl5+0mnj4f8/w= Received: by 10.181.58.9 with SMTP id l9mr2766933bkk.46.1225391948275; Thu, 30 Oct 2008 11:39:08 -0700 (PDT) Received: by 10.180.239.13 with HTTP; Thu, 30 Oct 2008 11:39:07 -0700 (PDT) Message-ID: Date: Thu, 30 Oct 2008 20:39:07 +0200 From: "Slava Gorelik" To: hbase-user@hadoop.apache.org Subject: Re: Regionserver fails to serve region In-Reply-To: <4909F47B.4060307@duboce.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_35799_14750315.1225391948280" References: <20028553.post@talk.nabble.com> <490747FD.9090500@duboce.net> <49075DE5.9050803@duboce.net> <49077EA3.9090000@duboce.net> <4909F47B.4060307@duboce.net> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_35799_14750315.1225391948280 Content-Type: multipart/alternative; boundary="----=_Part_35800_22880755.1225391948280" ------=_Part_35800_22880755.1225391948280 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi.Very strange, i see in limits.conf that it's upped. I attached the limits.conf, please have a look, may be i did it wrong. Best Regards. On Thu, Oct 30, 2008 at 7:52 PM, stack wrote: > Thanks for the logs Slava. I notice that you have not upped the ulimit on > your cluster. See the head of your logs where we print out the ulimit. Its > 1024. This could be one cause of your grief especially when you seemingly > have many regions (>1000). Please try upping it. > St.Ack > > > > > Slava Gorelik wrote: > >> Hi. >> I enabled DEBUG log level and now I'm sending all logs (archived) >> including fsck run result. >> Today my program starting to fail couple of minutes from the begin, it's >> very easy to reproduce the problem, cluster became very unstable. >> >> Best Regards. >> >> >> On Tue, Oct 28, 2008 at 11:05 PM, stack > stack@duboce.net>> wrote: >> >> See http://wiki.apache.org/hadoop/Hbase/FAQ#5 >> >> St.Ack >> >> >> Slava Gorelik wrote: >> >> Hi.First of all i want to say thank you for you assistance !!! >> >> >> DEBUG on hadoop or hbase ? And how can i enable ? >> fsck said that HDFS is healthy. >> >> Best Regards and Thank You >> >> >> On Tue, Oct 28, 2008 at 8:45 PM, stack > > wrote: >> >> >> Slava Gorelik wrote: >> >> >> Hi.HDFS capacity is about 800gb (8 datanodes) and the >> current usage is >> about >> 30GB. This is after total re-format of the HDFS that >> was made a hour >> before. >> >> BTW, the logs i sent are from the first exception that >> i found in them. >> Best Regards. >> >> >> >> Please enable DEBUG and retry. Send me all logs. What >> does the fsck on >> HDFS say? There is something seriously wrong with your >> cluster that you are >> having so much trouble getting it running. Lets try and >> figure it. >> >> St.Ack >> >> >> >> >> >> >> On Tue, Oct 28, 2008 at 7:12 PM, stack >> > wrote: >> >> >> >> >> I took a quick look Slava (Thanks for sending the >> files). Here's a few >> notes: >> >> + The logs are from after the damage is done; the >> transition from good to >> bad is missing. If I could see that, that would help >> + But what seems to be plain is that that your >> HDFS is very sick. See >> this >> from head of one of the regionserver logs: >> >> 2008-10-27 23:41:12,682 WARN >> org.apache.hadoop.dfs.DFSClient: >> DataStreamer >> Exception: java.io.IOException: Unable to create >> new block. >> at >> >> >> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349) >> at >> >> >> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735) >> at >> >> >> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912) >> >> 2008-10-27 23:41:12,682 WARN >> org.apache.hadoop.dfs.DFSClient: Error >> Recovery for block blk_-5188192041705782716_60000 >> bad datanode[0] >> 2008-10-27 23:41:12,685 ERROR >> >> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >> Compaction/Split >> failed for region >> >> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518 >> java.io.IOException: Could not get block >> locations. Aborting... >> >> >> If HDFS is ailing, hbase is too. In fact, the >> regionservers will shut >> themselves to protect themselves against damaging >> or losing data: >> >> 2008-10-27 23:41:12,688 FATAL >> org.apache.hadoop.hbase.regionserver.Flusher: >> Replay of hlog required. Forcing server restart >> >> So, whats up with your HDFS? Not enough space >> alloted? What happens if >> you run "./bin/hadoop fsck /"? Does that give you >> a clue as to what >> happened? Dig in the datanode and namenode logs. >> Look for where the >> exceptions start. It might give you a clue. >> >> + The suse regionserver log had garbage in it. >> >> St.Ack >> >> >> Slava Gorelik wrote: >> >> >> >> >> Hi. >> My happiness was very short :-( After i >> successfully added 1M rows (50k >> each row) i tried to add 10M rows. >> And after 3-4 working hours it started to >> dying. First one region server >> is died, after another one and eventually all >> cluster is dead. >> >> I attached log files (relevant part, archived) >> from region servers and >> from the master. >> >> Best Regards. >> >> >> >> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik < >> slava.gorelik@gmail.com >> > slava.gorelik@gmail.com >> >> wrote: >> >> Hi. >> So far so good, after changing the file >> descriptors >> and dfs.datanode.socket.write.timeout, >> dfs.datanode.max.xcievers >> my cluster works stable. >> Thank You and Best Regards. >> >> P.S. Regarding deleting multiple columns >> missing functionality i >> filled jira : >> https://issues.apache.org/jira/browse/HBASE-961 >> >> >> >> On Sun, Oct 26, 2008 at 12:58 AM, Michael >> Stack >> > >> >> wrote: >> >> Slava Gorelik wrote: >> >> Hi.Haven't tried yet them, i'll try >> tomorrow morning. In >> general cluster is >> working well, the problems begins if >> i'm trying to add 10M >> rows, after 1.2M >> if happened. >> >> Anything else running beside the >> regionserver or datanodes >> that would suck resources? When >> datanodes begin to slow, we >> begin to see the issue Jean-Adrien's >> configurations address. >> Are you uploading using MapReduce? Are >> TTs running on same >> nodes as the datanode and regionserver? >> How are you doing the >> upload? Describe what your uploader >> looks like (Sorry if >> you've already done this). >> >> >> I already changed the limit of files >> descriptors, >> >> Good. >> >> >> I'll try >> to change the properties: >> >> dfs.datanode.socket.write.timeout >> 0 >> >> >> >> dfs.datanode.max.xcievers >> 1023 >> >> >> >> Yeah, try it. >> >> >> And let you know, is any other >> prescriptions ? Did i miss >> something ? >> >> BTW, off topic, but i sent e-mail >> recently to the list and >> i can't see it: >> Is it possible to delete multiple >> columns in any way by >> regex : for example >> colum_name_* ? >> >> Not that I know of. If its not in the >> API, it should be. >> Mind filing a JIRA? >> >> Thanks Slava. >> St.Ack >> >> >> >> >> >> >> >> >> >> >> >> >> >> > ------=_Part_35800_22880755.1225391948280 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline
Hi.
Very strange, i see in limits.conf that it's upped.
I attached the limits.conf, please have a  look, may be i did it wrong.

Best Regards.


On Thu, Oct 30, 2008 at 7:52 PM, stack <stack@duboce.net> wrote:
Thanks for the logs Slava.  I notice that you have not upped the ulimit on your cluster.  See the head of your logs where we print out the ulimit.  Its 1024.  This could be one cause of your grief especially when you seemingly have many regions (>1000).  Please try upping it.

St.Ack




Slava Gorelik wrote:
Hi.

I enabled DEBUG log level and now I'm sending all logs (archived) including fsck run result.
Today my program starting to fail couple of minutes from the begin, it's very easy to reproduce the problem, cluster became very unstable.

Best Regards.


On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:stack@duboce.net>> wrote:

   See http://wiki.apache.org/hadoop/Hbase/FAQ#5

   St.Ack


   Slava Gorelik wrote:

       Hi.First of all i want to say thank you for you assistance !!!


       DEBUG on hadoop or hbase ? And how can i enable ?
       fsck said that HDFS is healthy.

       Best Regards and Thank You


       On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
       <mailto:stack@duboce.net>> wrote:

       
           Slava Gorelik wrote:

             
               Hi.HDFS capacity is about 800gb (8 datanodes) and the
               current usage is
               about
               30GB. This is after total re-format of the HDFS that
               was made a hour
               before.

               BTW, the logs i sent are from the first exception that
               i found in them.
               Best Regards.


                   
           Please enable DEBUG and retry.  Send me all logs.  What
           does the fsck on
           HDFS say?  There is something seriously wrong with your
           cluster that you are
           having so much trouble getting it running.  Lets try and
           figure it.

           St.Ack





             
               On Tue, Oct 28, 2008 at 7:12 PM, stack
               <stack@duboce.net <mailto:stack@duboce.net>> wrote:



                   
                   I took a quick look Slava (Thanks for sending the
                   files).   Here's a few
                   notes:

                   + The logs are from after the damage is done; the
                   transition from good to
                   bad is missing.  If I could see that, that would help
                   + But what seems to be plain is that that your
                   HDFS is very sick.  See
                   this
                   from head of one of the regionserver logs:

                   2008-10-27 23:41:12,682 WARN
                   org.apache.hadoop.dfs.DFSClient:
                   DataStreamer
                   Exception: java.io.IOException: Unable to create
                   new block.
                    at

                   org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
                    at

                   org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
                    at

                   org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

                   2008-10-27 23:41:12,682 WARN
                   org.apache.hadoop.dfs.DFSClient: Error
                   Recovery for block blk_-5188192041705782716_60000
                   bad datanode[0]
                   2008-10-27 23:41:12,685 ERROR
                   org.apache.hadoop.hbase.regionserver.CompactSplitThread:
                   Compaction/Split
                   failed for region
                   BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
                   java.io.IOException: Could not get block
                   locations. Aborting...


                   If HDFS is ailing, hbase is too.  In fact, the
                   regionservers will shut
                   themselves to protect themselves against damaging
                   or losing data:

                   2008-10-27 23:41:12,688 FATAL
                   org.apache.hadoop.hbase.regionserver.Flusher:
                   Replay of hlog required. Forcing server restart

                   So, whats up with your HDFS?  Not enough space
                   alloted?  What happens if
                   you run "./bin/hadoop fsck /"?  Does that give you
                   a clue as to what
                   happened?  Dig in the datanode and namenode logs.
                    Look for where the
                   exceptions start.  It might give you a clue.

                   + The suse regionserver log had garbage in it.

                   St.Ack


                   Slava Gorelik wrote:



                         
                       Hi.
                       My happiness was very short :-( After i
                       successfully added 1M rows (50k
                       each row) i tried to add 10M rows.
                       And after 3-4 working hours it started to
                       dying. First one region server
                       is died, after another one and eventually all
                       cluster is dead.

                       I attached log files (relevant part, archived)
                       from region servers and
                       from the master.

                       Best Regards.



                       On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
                       slava.gorelik@gmail.com
                       <mailto:slava.gorelik@gmail.com><mailto:

                       slava.gorelik@gmail.com
                       <mailto:slava.gorelik@gmail.com>>> wrote:

                        Hi.
                        So far so good, after changing the file
                       descriptors
                        and dfs.datanode.socket.write.timeout,
                       dfs.datanode.max.xcievers
                        my cluster works stable.
                        Thank You and Best Regards.

                        P.S. Regarding deleting multiple columns
                       missing functionality i
                        filled jira :
                       https://issues.apache.org/jira/browse/HBASE-961



                        On Sun, Oct 26, 2008 at 12:58 AM, Michael
                       Stack <stack@duboce.net <mailto:stack@duboce.net>
                        <mailto:stack@duboce.net

                       <mailto:stack@duboce.net>>> wrote:

                            Slava Gorelik wrote:

                                Hi.Haven't tried yet them, i'll try
                       tomorrow morning. In
                                general cluster is
                                working well, the problems begins if
                       i'm trying to add 10M
                                rows, after 1.2M
                                if happened.

                            Anything else running beside the
                       regionserver or datanodes
                            that would suck resources?  When
                       datanodes begin to slow, we
                            begin to see the issue Jean-Adrien's
                       configurations address.
                             Are you uploading using MapReduce?  Are
                       TTs running on same
                            nodes as the datanode and regionserver?
                        How are you doing the
                            upload?  Describe what your uploader
                       looks like (Sorry if
                            you've already done this).


                                 I already changed the limit of files
                       descriptors,

                            Good.


                                 I'll try
                                to change the properties:
                                 <property>
                       <name>dfs.datanode.socket.write.timeout</name>
                                 <value>0</value>
                                </property>

                                <property>
                                 <name>dfs.datanode.max.xcievers</name>
                                 <value>1023</value>
                                </property>


                            Yeah, try it.


                                And let you know, is any other
                       prescriptions ? Did i miss
                                something ?

                                BTW, off topic, but i sent e-mail
                       recently to the list and
                                i can't see it:
                                Is it possible to delete multiple
                       columns in any way by
                                regex : for example
                                colum_name_* ?

                            Not that I know of.  If its not in the
                       API, it should be.
                             Mind filing a JIRA?

                            Thanks Slava.
                            St.Ack






                               
                   
             

       




------=_Part_35800_22880755.1225391948280-- ------=_Part_35799_14750315.1225391948280--