hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slava Gorelik" <slava.gore...@gmail.com>
Subject Re: Regionserver fails to serve region
Date Thu, 30 Oct 2008 20:56:00 GMT
Hi.I also noticed this exception.
Strange that this exception is happened every time on the same regionserver.
Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
 Very strange, but history folder in hadoop is empty.

Reformatting HDFS  will help ?

One more things in a last minute, i found that one node in cluster has
totally different time, could this cause for such a problems ?

P.S. About logs, is it possible to send to some email - each log file
compressed is about 1mb, and only in 3 files i found exceptions.


On Thu, Oct 30, 2008 at 10:25 PM, stack <stack@duboce.net> wrote:

> Can you put them someplace that I can pull them?
>
> I took another look at your logs.  I see that a region is missing files.
>  That means it will never open and just keep trying.  Grep your logs for
> FileNotFound.  You'll see this:
>
> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
> File does not exist:
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
> File does not exist:
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>
> Try shutting down, and removing these files.   Remove the following
> directories:
>
>
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>
> Then retry restarting.
>
> You can try and figure how these files got lost by going back in your
> history.
>
>
> St.Ack
>
>
>
> Slava Gorelik wrote:
>
>> Michael,still have the problem, but the logs files are very big (50MB
>> each)
>> even compressed they are bigger than limit for this mailing list.
>> Most of the problems are happened during compaction (i see in the log),
>> may
>> be i can send some parts from logs ?
>>
>> Best Regards.
>>
>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <slava.gorelik@gmail.com
>> >wrote:
>>
>>
>>
>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
>>> soon
>>> will try again.
>>>
>>>
>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <slava.gorelik@gmail.com
>>> >wrote:
>>>
>>>
>>>
>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>> I attached the limits.conf, please have a  look, may be i did it wrong.
>>>>
>>>> Best Regards.
>>>>
>>>>
>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <stack@duboce.net> wrote:
>>>>
>>>>
>>>>
>>>>> Thanks for the logs Slava.  I notice that you have not upped the ulimit
>>>>> on your cluster.  See the head of your logs where we print out the
>>>>> ulimit.
>>>>>  Its 1024.  This could be one cause of your grief especially when you
>>>>> seemingly have many regions (>1000).  Please try upping it.
>>>>> St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi.
>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>>> including fsck run result.
>>>>>> Today my program starting to fail couple of minutes from the begin,
>>>>>> it's
>>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>>> stack@duboce.net>> wrote:
>>>>>>
>>>>>>   See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>
>>>>>>   St.Ack
>>>>>>
>>>>>>
>>>>>>   Slava Gorelik wrote:
>>>>>>
>>>>>>       Hi.First of all i want to say thank you for you assistance
!!!
>>>>>>
>>>>>>
>>>>>>       DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>>       fsck said that HDFS is healthy.
>>>>>>
>>>>>>       Best Regards and Thank You
>>>>>>
>>>>>>
>>>>>>       On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>       <mailto:stack@duboce.net>> wrote:
>>>>>>
>>>>>>
>>>>>>           Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>               Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>>               current usage is
>>>>>>               about
>>>>>>               30GB. This is after total re-format of the HDFS that
>>>>>>               was made a hour
>>>>>>               before.
>>>>>>
>>>>>>               BTW, the logs i sent are from the first exception that
>>>>>>               i found in them.
>>>>>>               Best Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>           Please enable DEBUG and retry.  Send me all logs.  What
>>>>>>           does the fsck on
>>>>>>           HDFS say?  There is something seriously wrong with your
>>>>>>           cluster that you are
>>>>>>           having so much trouble getting it running.  Lets try and
>>>>>>           figure it.
>>>>>>
>>>>>>           St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>               On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>               <stack@duboce.net <mailto:stack@duboce.net>>
wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                   I took a quick look Slava (Thanks for sending the
>>>>>>                   files).   Here's a few
>>>>>>                   notes:
>>>>>>
>>>>>>                   + The logs are from after the damage is done; the
>>>>>>                   transition from good to
>>>>>>                   bad is missing.  If I could see that, that would
>>>>>> help
>>>>>>                   + But what seems to be plain is that that your
>>>>>>                   HDFS is very sick.  See
>>>>>>                   this
>>>>>>                   from head of one of the regionserver logs:
>>>>>>
>>>>>>                   2008-10-27 23:41:12,682 WARN
>>>>>>                   org.apache.hadoop.dfs.DFSClient:
>>>>>>                   DataStreamer
>>>>>>                   Exception: java.io.IOException: Unable to create
>>>>>>                   new block.
>>>>>>                    at
>>>>>>
>>>>>>
>>>>>>
>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>                    at
>>>>>>
>>>>>>
>>>>>>
>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>                    at
>>>>>>
>>>>>>
>>>>>>
>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>
>>>>>>                   2008-10-27 23:41:12,682 WARN
>>>>>>                   org.apache.hadoop.dfs.DFSClient: Error
>>>>>>                   Recovery for block blk_-5188192041705782716_60000
>>>>>>                   bad datanode[0]
>>>>>>                   2008-10-27 23:41:12,685 ERROR
>>>>>>
>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>                   Compaction/Split
>>>>>>                   failed for region
>>>>>>
>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>                   java.io.IOException: Could not get block
>>>>>>                   locations. Aborting...
>>>>>>
>>>>>>
>>>>>>                   If HDFS is ailing, hbase is too.  In fact, the
>>>>>>                   regionservers will shut
>>>>>>                   themselves to protect themselves against damaging
>>>>>>                   or losing data:
>>>>>>
>>>>>>                   2008-10-27 23:41:12,688 FATAL
>>>>>>                   org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>                   Replay of hlog required. Forcing server restart
>>>>>>
>>>>>>                   So, whats up with your HDFS?  Not enough space
>>>>>>                   alloted?  What happens if
>>>>>>                   you run "./bin/hadoop fsck /"?  Does that give
you
>>>>>>                   a clue as to what
>>>>>>                   happened?  Dig in the datanode and namenode logs.
>>>>>>                    Look for where the
>>>>>>                   exceptions start.  It might give you a clue.
>>>>>>
>>>>>>                   + The suse regionserver log had garbage in it.
>>>>>>
>>>>>>                   St.Ack
>>>>>>
>>>>>>
>>>>>>                   Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                       Hi.
>>>>>>                       My happiness was very short :-( After i
>>>>>>                       successfully added 1M rows (50k
>>>>>>                       each row) i tried to add 10M rows.
>>>>>>                       And after 3-4 working hours it started to
>>>>>>                       dying. First one region server
>>>>>>                       is died, after another one and eventually all
>>>>>>                       cluster is dead.
>>>>>>
>>>>>>                       I attached log files (relevant part, archived)
>>>>>>                       from region servers and
>>>>>>                       from the master.
>>>>>>
>>>>>>                       Best Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>                       On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik
>>>>>> <
>>>>>>                       slava.gorelik@gmail.com
>>>>>>                       <mailto:slava.gorelik@gmail.com><mailto:
>>>>>>                       slava.gorelik@gmail.com
>>>>>>                       <mailto:slava.gorelik@gmail.com>>>
wrote:
>>>>>>
>>>>>>                        Hi.
>>>>>>                        So far so good, after changing the file
>>>>>>                       descriptors
>>>>>>                        and dfs.datanode.socket.write.timeout,
>>>>>>                       dfs.datanode.max.xcievers
>>>>>>                        my cluster works stable.
>>>>>>                        Thank You and Best Regards.
>>>>>>
>>>>>>                        P.S. Regarding deleting multiple columns
>>>>>>                       missing functionality i
>>>>>>                        filled jira :
>>>>>>                       https://issues.apache.org/jira/browse/HBASE-961
>>>>>>
>>>>>>
>>>>>>
>>>>>>                        On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>>                       Stack <stack@duboce.net <mailto:
>>>>>> stack@duboce.net
>>>>>>                                  <mailto:stack@duboce.net
>>>>>>
>>>>>>                       <mailto:stack@duboce.net>>> wrote:
>>>>>>
>>>>>>                            Slava Gorelik wrote:
>>>>>>
>>>>>>                                Hi.Haven't tried yet them, i'll try
>>>>>>                       tomorrow morning. In
>>>>>>                                general cluster is
>>>>>>                                working well, the problems begins
if
>>>>>>                       i'm trying to add 10M
>>>>>>                                rows, after 1.2M
>>>>>>                                if happened.
>>>>>>
>>>>>>                            Anything else running beside the
>>>>>>                       regionserver or datanodes
>>>>>>                            that would suck resources?  When
>>>>>>                       datanodes begin to slow, we
>>>>>>                            begin to see the issue Jean-Adrien's
>>>>>>                       configurations address.
>>>>>>                             Are you uploading using MapReduce?  Are
>>>>>>                       TTs running on same
>>>>>>                            nodes as the datanode and regionserver?
>>>>>>                        How are you doing the
>>>>>>                            upload?  Describe what your uploader
>>>>>>                       looks like (Sorry if
>>>>>>                            you've already done this).
>>>>>>
>>>>>>
>>>>>>                                 I already changed the limit of files
>>>>>>                       descriptors,
>>>>>>
>>>>>>                            Good.
>>>>>>
>>>>>>
>>>>>>                                 I'll try
>>>>>>                                to change the properties:
>>>>>>                                 <property>
>>>>>>                       <name>dfs.datanode.socket.write.timeout</name>
>>>>>>                                 <value>0</value>
>>>>>>                                </property>
>>>>>>
>>>>>>                                <property>
>>>>>>                                 <name>dfs.datanode.max.xcievers</name>
>>>>>>                                 <value>1023</value>
>>>>>>                                </property>
>>>>>>
>>>>>>
>>>>>>                            Yeah, try it.
>>>>>>
>>>>>>
>>>>>>                                And let you know, is any other
>>>>>>                       prescriptions ? Did i miss
>>>>>>                                something ?
>>>>>>
>>>>>>                                BTW, off topic, but i sent e-mail
>>>>>>                       recently to the list and
>>>>>>                                i can't see it:
>>>>>>                                Is it possible to delete multiple
>>>>>>                       columns in any way by
>>>>>>                                regex : for example
>>>>>>                                colum_name_* ?
>>>>>>
>>>>>>                            Not that I know of.  If its not in the
>>>>>>                       API, it should be.
>>>>>>                             Mind filing a JIRA?
>>>>>>
>>>>>>                            Thanks Slava.
>>>>>>                            St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message