hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Regionserver fails to serve region
Date Thu, 30 Oct 2008 20:25:45 GMT
Can you put them someplace that I can pull them?

I took another look at your logs.  I see that a region is missing 
files.  That means it will never open and just keep trying.  Grep your 
logs for FileNotFound.  You'll see this:

hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: 
File does not exist: 
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: 
File does not exist: 
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data

Try shutting down, and removing these files.   Remove the following 
directories:

hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637

Then retry restarting.

You can try and figure how these files got lost by going back in your 
history.

St.Ack



Slava Gorelik wrote:
> Michael,still have the problem, but the logs files are very big (50MB each)
> even compressed they are bigger than limit for this mailing list.
> Most of the problems are happened during compaction (i see in the log), may
> be i can send some parts from logs ?
>
> Best Regards.
>
> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <slava.gorelik@gmail.com>wrote:
>
>   
>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon
>> will try again.
>>
>>
>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <slava.gorelik@gmail.com>wrote:
>>
>>     
>>> Hi.Very strange, i see in limits.conf that it's upped.
>>> I attached the limits.conf, please have a  look, may be i did it wrong.
>>>
>>> Best Regards.
>>>
>>>
>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <stack@duboce.net> wrote:
>>>
>>>       
>>>> Thanks for the logs Slava.  I notice that you have not upped the ulimit
>>>> on your cluster.  See the head of your logs where we print out the ulimit.
>>>>  Its 1024.  This could be one cause of your grief especially when you
>>>> seemingly have many regions (>1000).  Please try upping it.
>>>> St.Ack
>>>>
>>>>
>>>>
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>         
>>>>> Hi.
>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>> including fsck run result.
>>>>> Today my program starting to fail couple of minutes from the begin, it's
>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>
>>>>> Best Regards.
>>>>>
>>>>>
>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>> stack@duboce.net>> wrote:
>>>>>
>>>>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>
>>>>>    St.Ack
>>>>>
>>>>>
>>>>>    Slava Gorelik wrote:
>>>>>
>>>>>        Hi.First of all i want to say thank you for you assistance !!!
>>>>>
>>>>>
>>>>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>        fsck said that HDFS is healthy.
>>>>>
>>>>>        Best Regards and Thank You
>>>>>
>>>>>
>>>>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>        <mailto:stack@duboce.net>> wrote:
>>>>>
>>>>>
>>>>>            Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>                current usage is
>>>>>                about
>>>>>                30GB. This is after total re-format of the HDFS that
>>>>>                was made a hour
>>>>>                before.
>>>>>
>>>>>                BTW, the logs i sent are from the first exception that
>>>>>                i found in them.
>>>>>                Best Regards.
>>>>>
>>>>>
>>>>>
>>>>>            Please enable DEBUG and retry.  Send me all logs.  What
>>>>>            does the fsck on
>>>>>            HDFS say?  There is something seriously wrong with your
>>>>>            cluster that you are
>>>>>            having so much trouble getting it running.  Lets try and
>>>>>            figure it.
>>>>>
>>>>>            St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>                <stack@duboce.net <mailto:stack@duboce.net>>
wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                    I took a quick look Slava (Thanks for sending the
>>>>>                    files).   Here's a few
>>>>>                    notes:
>>>>>
>>>>>                    + The logs are from after the damage is done; the
>>>>>                    transition from good to
>>>>>                    bad is missing.  If I could see that, that would help
>>>>>                    + But what seems to be plain is that that your
>>>>>                    HDFS is very sick.  See
>>>>>                    this
>>>>>                    from head of one of the regionserver logs:
>>>>>
>>>>>                    2008-10-27 23:41:12,682 WARN
>>>>>                    org.apache.hadoop.dfs.DFSClient:
>>>>>                    DataStreamer
>>>>>                    Exception: java.io.IOException: Unable to create
>>>>>                    new block.
>>>>>                     at
>>>>>
>>>>>
>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>                     at
>>>>>
>>>>>
>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>                     at
>>>>>
>>>>>
>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>
>>>>>                    2008-10-27 23:41:12,682 WARN
>>>>>                    org.apache.hadoop.dfs.DFSClient: Error
>>>>>                    Recovery for block blk_-5188192041705782716_60000
>>>>>                    bad datanode[0]
>>>>>                    2008-10-27 23:41:12,685 ERROR
>>>>>
>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>                    Compaction/Split
>>>>>                    failed for region
>>>>>
>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>                    java.io.IOException: Could not get block
>>>>>                    locations. Aborting...
>>>>>
>>>>>
>>>>>                    If HDFS is ailing, hbase is too.  In fact, the
>>>>>                    regionservers will shut
>>>>>                    themselves to protect themselves against damaging
>>>>>                    or losing data:
>>>>>
>>>>>                    2008-10-27 23:41:12,688 FATAL
>>>>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>                    Replay of hlog required. Forcing server restart
>>>>>
>>>>>                    So, whats up with your HDFS?  Not enough space
>>>>>                    alloted?  What happens if
>>>>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>>>>                    a clue as to what
>>>>>                    happened?  Dig in the datanode and namenode logs.
>>>>>                     Look for where the
>>>>>                    exceptions start.  It might give you a clue.
>>>>>
>>>>>                    + The suse regionserver log had garbage in it.
>>>>>
>>>>>                    St.Ack
>>>>>
>>>>>
>>>>>                    Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                        Hi.
>>>>>                        My happiness was very short :-( After i
>>>>>                        successfully added 1M rows (50k
>>>>>                        each row) i tried to add 10M rows.
>>>>>                        And after 3-4 working hours it started to
>>>>>                        dying. First one region server
>>>>>                        is died, after another one and eventually all
>>>>>                        cluster is dead.
>>>>>
>>>>>                        I attached log files (relevant part, archived)
>>>>>                        from region servers and
>>>>>                        from the master.
>>>>>
>>>>>                        Best Regards.
>>>>>
>>>>>
>>>>>
>>>>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik
<
>>>>>                        slava.gorelik@gmail.com
>>>>>                        <mailto:slava.gorelik@gmail.com><mailto:
>>>>>                        slava.gorelik@gmail.com
>>>>>                        <mailto:slava.gorelik@gmail.com>>>
wrote:
>>>>>
>>>>>                         Hi.
>>>>>                         So far so good, after changing the file
>>>>>                        descriptors
>>>>>                         and dfs.datanode.socket.write.timeout,
>>>>>                        dfs.datanode.max.xcievers
>>>>>                         my cluster works stable.
>>>>>                         Thank You and Best Regards.
>>>>>
>>>>>                         P.S. Regarding deleting multiple columns
>>>>>                        missing functionality i
>>>>>                         filled jira :
>>>>>                        https://issues.apache.org/jira/browse/HBASE-961
>>>>>
>>>>>
>>>>>
>>>>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>                        Stack <stack@duboce.net <mailto:stack@duboce.net
>>>>>           
>>>>>                         <mailto:stack@duboce.net
>>>>>
>>>>>                        <mailto:stack@duboce.net>>> wrote:
>>>>>
>>>>>                             Slava Gorelik wrote:
>>>>>
>>>>>                                 Hi.Haven't tried yet them, i'll try
>>>>>                        tomorrow morning. In
>>>>>                                 general cluster is
>>>>>                                 working well, the problems begins if
>>>>>                        i'm trying to add 10M
>>>>>                                 rows, after 1.2M
>>>>>                                 if happened.
>>>>>
>>>>>                             Anything else running beside the
>>>>>                        regionserver or datanodes
>>>>>                             that would suck resources?  When
>>>>>                        datanodes begin to slow, we
>>>>>                             begin to see the issue Jean-Adrien's
>>>>>                        configurations address.
>>>>>                              Are you uploading using MapReduce?  Are
>>>>>                        TTs running on same
>>>>>                             nodes as the datanode and regionserver?
>>>>>                         How are you doing the
>>>>>                             upload?  Describe what your uploader
>>>>>                        looks like (Sorry if
>>>>>                             you've already done this).
>>>>>
>>>>>
>>>>>                                  I already changed the limit of files
>>>>>                        descriptors,
>>>>>
>>>>>                             Good.
>>>>>
>>>>>
>>>>>                                  I'll try
>>>>>                                 to change the properties:
>>>>>                                  <property>
>>>>>                        <name>dfs.datanode.socket.write.timeout</name>
>>>>>                                  <value>0</value>
>>>>>                                 </property>
>>>>>
>>>>>                                 <property>
>>>>>                                  <name>dfs.datanode.max.xcievers</name>
>>>>>                                  <value>1023</value>
>>>>>                                 </property>
>>>>>
>>>>>
>>>>>                             Yeah, try it.
>>>>>
>>>>>
>>>>>                                 And let you know, is any other
>>>>>                        prescriptions ? Did i miss
>>>>>                                 something ?
>>>>>
>>>>>                                 BTW, off topic, but i sent e-mail
>>>>>                        recently to the list and
>>>>>                                 i can't see it:
>>>>>                                 Is it possible to delete multiple
>>>>>                        columns in any way by
>>>>>                                 regex : for example
>>>>>                                 colum_name_* ?
>>>>>
>>>>>                             Not that I know of.  If its not in the
>>>>>                        API, it should be.
>>>>>                              Mind filing a JIRA?
>>>>>
>>>>>                             Thanks Slava.
>>>>>                             St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>
>   


Mime
View raw message