Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4909F47B.4060307@duboce.net>
Date: Thu, 30 Oct 2008 10:52:59 -0700
From: stack <stack@duboce.net>
User-Agent: Thunderbird 2.0.0.17 (Macintosh/20080914)
MIME-Version: 1.0
To: hbase-user@hadoop.apache.org
Subject: Re: Regionserver fails to serve region
References: <20028553.post@talk.nabble.com>
	 <fdc46e690810251521v700ff9f1v33f50a6a84af0bac@mail.gmail.com>
	 <4903A4AF.7080601@duboce.net>
	 <fdc46e690810270219i176a24sa30d7a412172eb1@mail.gmail.com>
	 <fdc46e690810280414p185a7dbn4594f42a02e2cf43@mail.gmail.com>
	 <490747FD.9090500@duboce.net>
	 <fdc46e690810281136l5a4b5588kb1fa0fdeb1e1524d@mail.gmail.com>
	 <49075DE5.9050803@duboce.net>
	 <fdc46e690810281231p6e5660d3v480a2e92f6841976@mail.gmail.com>
	 <49077EA3.9090000@duboce.net>
 <fdc46e690810290251y5d5e9a59ga57eb069e5558956@mail.gmail.com>
In-Reply-To: <fdc46e690810290251y5d5e9a59ga57eb069e5558956@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Thanks for the logs Slava.  I notice that you have not upped the ulimit 
on your cluster.  See the head of your logs where we print out the 
ulimit.  Its 1024.  This could be one cause of your grief especially 
when you seemingly have many regions (>1000).  Please try upping it.
St.Ack


Slava Gorelik wrote:
> Hi.
> I enabled DEBUG log level and now I'm sending all logs (archived) 
> including fsck run result.
> Today my program starting to fail couple of minutes from the begin, 
> it's very easy to reproduce the problem, cluster became very unstable.
>
> Best Regards.
>
>
> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net 
> <mailto:stack@duboce.net>> wrote:
>
>     See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>
>     St.Ack
>
>
>     Slava Gorelik wrote:
>
>         Hi.First of all i want to say thank you for you assistance !!!
>
>
>         DEBUG on hadoop or hbase ? And how can i enable ?
>         fsck said that HDFS is healthy.
>
>         Best Regards and Thank You
>
>
>         On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>         <mailto:stack@duboce.net>> wrote:
>
>          
>
>             Slava Gorelik wrote:
>
>                
>
>                 Hi.HDFS capacity is about 800gb (8 datanodes) and the
>                 current usage is
>                 about
>                 30GB. This is after total re-format of the HDFS that
>                 was made a hour
>                 before.
>
>                 BTW, the logs i sent are from the first exception that
>                 i found in them.
>                 Best Regards.
>
>
>                      
>
>             Please enable DEBUG and retry.  Send me all logs.  What
>             does the fsck on
>             HDFS say?  There is something seriously wrong with your
>             cluster that you are
>             having so much trouble getting it running.  Lets try and
>             figure it.
>
>             St.Ack
>
>
>
>
>
>                
>
>                 On Tue, Oct 28, 2008 at 7:12 PM, stack
>                 <stack@duboce.net <mailto:stack@duboce.net>> wrote:
>
>
>
>                      
>
>                     I took a quick look Slava (Thanks for sending the
>                     files).   Here's a few
>                     notes:
>
>                     + The logs are from after the damage is done; the
>                     transition from good to
>                     bad is missing.  If I could see that, that would help
>                     + But what seems to be plain is that that your
>                     HDFS is very sick.  See
>                     this
>                     from head of one of the regionserver logs:
>
>                     2008-10-27 23:41:12,682 WARN
>                     org.apache.hadoop.dfs.DFSClient:
>                     DataStreamer
>                     Exception: java.io.IOException: Unable to create
>                     new block.
>                      at
>
>                     org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>                      at
>
>                     org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>                      at
>
>                     org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>
>                     2008-10-27 23:41:12,682 WARN
>                     org.apache.hadoop.dfs.DFSClient: Error
>                     Recovery for block blk_-5188192041705782716_60000
>                     bad datanode[0]
>                     2008-10-27 23:41:12,685 ERROR
>                     org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>                     Compaction/Split
>                     failed for region
>                     BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>                     java.io.IOException: Could not get block
>                     locations. Aborting...
>
>
>                     If HDFS is ailing, hbase is too.  In fact, the
>                     regionservers will shut
>                     themselves to protect themselves against damaging
>                     or losing data:
>
>                     2008-10-27 23:41:12,688 FATAL
>                     org.apache.hadoop.hbase.regionserver.Flusher:
>                     Replay of hlog required. Forcing server restart
>
>                     So, whats up with your HDFS?  Not enough space
>                     alloted?  What happens if
>                     you run "./bin/hadoop fsck /"?  Does that give you
>                     a clue as to what
>                     happened?  Dig in the datanode and namenode logs.
>                      Look for where the
>                     exceptions start.  It might give you a clue.
>
>                     + The suse regionserver log had garbage in it.
>
>                     St.Ack
>
>
>                     Slava Gorelik wrote:
>
>
>
>                            
>
>                         Hi.
>                         My happiness was very short :-( After i
>                         successfully added 1M rows (50k
>                         each row) i tried to add 10M rows.
>                         And after 3-4 working hours it started to
>                         dying. First one region server
>                         is died, after another one and eventually all
>                         cluster is dead.
>
>                         I attached log files (relevant part, archived)
>                         from region servers and
>                         from the master.
>
>                         Best Regards.
>
>
>
>                         On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>                         slava.gorelik@gmail.com
>                         <mailto:slava.gorelik@gmail.com><mailto:
>                         slava.gorelik@gmail.com
>                         <mailto:slava.gorelik@gmail.com>>> wrote:
>
>                          Hi.
>                          So far so good, after changing the file
>                         descriptors
>                          and dfs.datanode.socket.write.timeout,
>                         dfs.datanode.max.xcievers
>                          my cluster works stable.
>                          Thank You and Best Regards.
>
>                          P.S. Regarding deleting multiple columns
>                         missing functionality i
>                          filled jira :
>                         https://issues.apache.org/jira/browse/HBASE-961
>
>
>
>                          On Sun, Oct 26, 2008 at 12:58 AM, Michael
>                         Stack <stack@duboce.net <mailto:stack@duboce.net>
>                          <mailto:stack@duboce.net
>                         <mailto:stack@duboce.net>>> wrote:
>
>                              Slava Gorelik wrote:
>
>                                  Hi.Haven't tried yet them, i'll try
>                         tomorrow morning. In
>                                  general cluster is
>                                  working well, the problems begins if
>                         i'm trying to add 10M
>                                  rows, after 1.2M
>                                  if happened.
>
>                              Anything else running beside the
>                         regionserver or datanodes
>                              that would suck resources?  When
>                         datanodes begin to slow, we
>                              begin to see the issue Jean-Adrien's
>                         configurations address.
>                               Are you uploading using MapReduce?  Are
>                         TTs running on same
>                              nodes as the datanode and regionserver?
>                          How are you doing the
>                              upload?  Describe what your uploader
>                         looks like (Sorry if
>                              you've already done this).
>
>
>                                   I already changed the limit of files
>                         descriptors,
>
>                              Good.
>
>
>                                   I'll try
>                                  to change the properties:
>                                   <property>
>                         <name>dfs.datanode.socket.write.timeout</name>
>                                   <value>0</value>
>                                  </property>
>
>                                  <property>
>                                   <name>dfs.datanode.max.xcievers</name>
>                                   <value>1023</value>
>                                  </property>
>
>
>                              Yeah, try it.
>
>
>                                  And let you know, is any other
>                         prescriptions ? Did i miss
>                                  something ?
>
>                                  BTW, off topic, but i sent e-mail
>                         recently to the list and
>                                  i can't see it:
>                                  Is it possible to delete multiple
>                         columns in any way by
>                                  regex : for example
>                                  colum_name_* ?
>
>                              Not that I know of.  If its not in the
>                         API, it should be.
>                               Mind filing a JIRA?
>
>                              Thanks Slava.
>                              St.Ack
>
>
>
>
>
>
>                                  
>
>                      
>
>                
>
>
>          
>
>
>