hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slava Gorelik" <slava.gore...@gmail.com>
Subject Re: Regionserver fails to serve region
Date Tue, 28 Oct 2008 19:31:51 GMT
Hi.First of all i want to say thank you for you assistance !!!

DEBUG on hadoop or hbase ? And how can i enable ?
fsck said that HDFS is healthy.

Best Regards and Thank You


On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is
>> about
>> 30GB. This is after total re-format of the HDFS that was made a hour
>> before.
>>
>> BTW, the logs i sent are from the first exception that i found in them.
>> Best Regards.
>>
>>
> Please enable DEBUG and retry.  Send me all logs.  What does the fsck on
> HDFS say?  There is something seriously wrong with your cluster that you are
> having so much trouble getting it running.  Lets try and figure it.
>
> St.Ack
>
>
>
>
>
>> On Tue, Oct 28, 2008 at 7:12 PM, stack <stack@duboce.net> wrote:
>>
>>
>>
>>> I took a quick look Slava (Thanks for sending the files).   Here's a few
>>> notes:
>>>
>>> + The logs are from after the damage is done; the transition from good to
>>> bad is missing.  If I could see that, that would help
>>> + But what seems to be plain is that that your HDFS is very sick.  See
>>> this
>>> from head of one of the regionserver logs:
>>>
>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient:
>>> DataStreamer
>>> Exception: java.io.IOException: Unable to create new block.
>>>  at
>>>
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>  at
>>>
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>  at
>>>
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>
>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
>>> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
>>> 2008-10-27 23:41:12,685 ERROR
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>>> failed for region
>>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>> java.io.IOException: Could not get block locations. Aborting...
>>>
>>>
>>> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
>>> themselves to protect themselves against damaging or losing data:
>>>
>>> 2008-10-27 23:41:12,688 FATAL
>>> org.apache.hadoop.hbase.regionserver.Flusher:
>>> Replay of hlog required. Forcing server restart
>>>
>>> So, whats up with your HDFS?  Not enough space alloted?  What happens if
>>> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
>>> happened?  Dig in the datanode and namenode logs.  Look for where the
>>> exceptions start.  It might give you a clue.
>>>
>>> + The suse regionserver log had garbage in it.
>>>
>>> St.Ack
>>>
>>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Hi.
>>>> My happiness was very short :-( After i successfully added 1M rows (50k
>>>> each row) i tried to add 10M rows.
>>>> And after 3-4 working hours it started to dying. First one region server
>>>> is died, after another one and eventually all cluster is dead.
>>>>
>>>> I attached log files (relevant part, archived) from region servers and
>>>> from the master.
>>>>
>>>> Best Regards.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>> slava.gorelik@gmail.com<mailto:
>>>> slava.gorelik@gmail.com>> wrote:
>>>>
>>>>   Hi.
>>>>   So far so good, after changing the file descriptors
>>>>   and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>>>   my cluster works stable.
>>>>   Thank You and Best Regards.
>>>>
>>>>   P.S. Regarding deleting multiple columns missing functionality i
>>>>   filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>>>
>>>>
>>>>
>>>>   On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>>>>   <mailto:stack@duboce.net>> wrote:
>>>>
>>>>       Slava Gorelik wrote:
>>>>
>>>>           Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>>>           general cluster is
>>>>           working well, the problems begins if i'm trying to add 10M
>>>>           rows, after 1.2M
>>>>           if happened.
>>>>
>>>>       Anything else running beside the regionserver or datanodes
>>>>       that would suck resources?  When datanodes begin to slow, we
>>>>       begin to see the issue Jean-Adrien's configurations address.
>>>>        Are you uploading using MapReduce?  Are TTs running on same
>>>>       nodes as the datanode and regionserver?  How are you doing the
>>>>       upload?  Describe what your uploader looks like (Sorry if
>>>>       you've already done this).
>>>>
>>>>
>>>>            I already changed the limit of files descriptors,
>>>>
>>>>       Good.
>>>>
>>>>
>>>>            I'll try
>>>>           to change the properties:
>>>>            <property> <name>dfs.datanode.socket.write.timeout</name>
>>>>            <value>0</value>
>>>>           </property>
>>>>
>>>>           <property>
>>>>            <name>dfs.datanode.max.xcievers</name>
>>>>            <value>1023</value>
>>>>           </property>
>>>>
>>>>
>>>>       Yeah, try it.
>>>>
>>>>
>>>>           And let you know, is any other prescriptions ? Did i miss
>>>>           something ?
>>>>
>>>>           BTW, off topic, but i sent e-mail recently to the list and
>>>>           i can't see it:
>>>>           Is it possible to delete multiple columns in any way by
>>>>           regex : for example
>>>>           colum_name_* ?
>>>>
>>>>       Not that I know of.  If its not in the API, it should be.
>>>>        Mind filing a JIRA?
>>>>
>>>>       Thanks Slava.
>>>>       St.Ack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message