hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slava Gorelik" <slava.gore...@gmail.com>
Subject Re: Regionserver fails to serve region
Date Thu, 30 Oct 2008 18:49:16 GMT
Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon
will try again.


On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <slava.gorelik@gmail.com>wrote:

> Hi.Very strange, i see in limits.conf that it's upped.
> I attached the limits.conf, please have a  look, may be i did it wrong.
>
> Best Regards.
>
>
> On Thu, Oct 30, 2008 at 7:52 PM, stack <stack@duboce.net> wrote:
>
>> Thanks for the logs Slava.  I notice that you have not upped the ulimit on
>> your cluster.  See the head of your logs where we print out the ulimit.  Its
>> 1024.  This could be one cause of your grief especially when you seemingly
>> have many regions (>1000).  Please try upping it.
>> St.Ack
>>
>>
>>
>>
>> Slava Gorelik wrote:
>>
>>> Hi.
>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>> including fsck run result.
>>> Today my program starting to fail couple of minutes from the begin, it's
>>> very easy to reproduce the problem, cluster became very unstable.
>>>
>>> Best Regards.
>>>
>>>
>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>> stack@duboce.net>> wrote:
>>>
>>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>
>>>    St.Ack
>>>
>>>
>>>    Slava Gorelik wrote:
>>>
>>>        Hi.First of all i want to say thank you for you assistance !!!
>>>
>>>
>>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>>        fsck said that HDFS is healthy.
>>>
>>>        Best Regards and Thank You
>>>
>>>
>>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>        <mailto:stack@duboce.net>> wrote:
>>>
>>>
>>>            Slava Gorelik wrote:
>>>
>>>
>>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>                current usage is
>>>                about
>>>                30GB. This is after total re-format of the HDFS that
>>>                was made a hour
>>>                before.
>>>
>>>                BTW, the logs i sent are from the first exception that
>>>                i found in them.
>>>                Best Regards.
>>>
>>>
>>>
>>>            Please enable DEBUG and retry.  Send me all logs.  What
>>>            does the fsck on
>>>            HDFS say?  There is something seriously wrong with your
>>>            cluster that you are
>>>            having so much trouble getting it running.  Lets try and
>>>            figure it.
>>>
>>>            St.Ack
>>>
>>>
>>>
>>>
>>>
>>>
>>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>                <stack@duboce.net <mailto:stack@duboce.net>> wrote:
>>>
>>>
>>>
>>>
>>>                    I took a quick look Slava (Thanks for sending the
>>>                    files).   Here's a few
>>>                    notes:
>>>
>>>                    + The logs are from after the damage is done; the
>>>                    transition from good to
>>>                    bad is missing.  If I could see that, that would help
>>>                    + But what seems to be plain is that that your
>>>                    HDFS is very sick.  See
>>>                    this
>>>                    from head of one of the regionserver logs:
>>>
>>>                    2008-10-27 23:41:12,682 WARN
>>>                    org.apache.hadoop.dfs.DFSClient:
>>>                    DataStreamer
>>>                    Exception: java.io.IOException: Unable to create
>>>                    new block.
>>>                     at
>>>
>>>
>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>                     at
>>>
>>>
>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>                     at
>>>
>>>
>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>
>>>                    2008-10-27 23:41:12,682 WARN
>>>                    org.apache.hadoop.dfs.DFSClient: Error
>>>                    Recovery for block blk_-5188192041705782716_60000
>>>                    bad datanode[0]
>>>                    2008-10-27 23:41:12,685 ERROR
>>>
>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>                    Compaction/Split
>>>                    failed for region
>>>
>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>                    java.io.IOException: Could not get block
>>>                    locations. Aborting...
>>>
>>>
>>>                    If HDFS is ailing, hbase is too.  In fact, the
>>>                    regionservers will shut
>>>                    themselves to protect themselves against damaging
>>>                    or losing data:
>>>
>>>                    2008-10-27 23:41:12,688 FATAL
>>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>>                    Replay of hlog required. Forcing server restart
>>>
>>>                    So, whats up with your HDFS?  Not enough space
>>>                    alloted?  What happens if
>>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>>                    a clue as to what
>>>                    happened?  Dig in the datanode and namenode logs.
>>>                     Look for where the
>>>                    exceptions start.  It might give you a clue.
>>>
>>>                    + The suse regionserver log had garbage in it.
>>>
>>>                    St.Ack
>>>
>>>
>>>                    Slava Gorelik wrote:
>>>
>>>
>>>
>>>
>>>                        Hi.
>>>                        My happiness was very short :-( After i
>>>                        successfully added 1M rows (50k
>>>                        each row) i tried to add 10M rows.
>>>                        And after 3-4 working hours it started to
>>>                        dying. First one region server
>>>                        is died, after another one and eventually all
>>>                        cluster is dead.
>>>
>>>                        I attached log files (relevant part, archived)
>>>                        from region servers and
>>>                        from the master.
>>>
>>>                        Best Regards.
>>>
>>>
>>>
>>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>                        slava.gorelik@gmail.com
>>>                        <mailto:slava.gorelik@gmail.com><mailto:
>>>                        slava.gorelik@gmail.com
>>>                        <mailto:slava.gorelik@gmail.com>>> wrote:
>>>
>>>                         Hi.
>>>                         So far so good, after changing the file
>>>                        descriptors
>>>                         and dfs.datanode.socket.write.timeout,
>>>                        dfs.datanode.max.xcievers
>>>                         my cluster works stable.
>>>                         Thank You and Best Regards.
>>>
>>>                         P.S. Regarding deleting multiple columns
>>>                        missing functionality i
>>>                         filled jira :
>>>                        https://issues.apache.org/jira/browse/HBASE-961
>>>
>>>
>>>
>>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>                        Stack <stack@duboce.net <mailto:stack@duboce.net>
>>>                         <mailto:stack@duboce.net
>>>
>>>                        <mailto:stack@duboce.net>>> wrote:
>>>
>>>                             Slava Gorelik wrote:
>>>
>>>                                 Hi.Haven't tried yet them, i'll try
>>>                        tomorrow morning. In
>>>                                 general cluster is
>>>                                 working well, the problems begins if
>>>                        i'm trying to add 10M
>>>                                 rows, after 1.2M
>>>                                 if happened.
>>>
>>>                             Anything else running beside the
>>>                        regionserver or datanodes
>>>                             that would suck resources?  When
>>>                        datanodes begin to slow, we
>>>                             begin to see the issue Jean-Adrien's
>>>                        configurations address.
>>>                              Are you uploading using MapReduce?  Are
>>>                        TTs running on same
>>>                             nodes as the datanode and regionserver?
>>>                         How are you doing the
>>>                             upload?  Describe what your uploader
>>>                        looks like (Sorry if
>>>                             you've already done this).
>>>
>>>
>>>                                  I already changed the limit of files
>>>                        descriptors,
>>>
>>>                             Good.
>>>
>>>
>>>                                  I'll try
>>>                                 to change the properties:
>>>                                  <property>
>>>                        <name>dfs.datanode.socket.write.timeout</name>
>>>                                  <value>0</value>
>>>                                 </property>
>>>
>>>                                 <property>
>>>                                  <name>dfs.datanode.max.xcievers</name>
>>>                                  <value>1023</value>
>>>                                 </property>
>>>
>>>
>>>                             Yeah, try it.
>>>
>>>
>>>                                 And let you know, is any other
>>>                        prescriptions ? Did i miss
>>>                                 something ?
>>>
>>>                                 BTW, off topic, but i sent e-mail
>>>                        recently to the list and
>>>                                 i can't see it:
>>>                                 Is it possible to delete multiple
>>>                        columns in any way by
>>>                                 regex : for example
>>>                                 colum_name_* ?
>>>
>>>                             Not that I know of.  If its not in the
>>>                        API, it should be.
>>>                              Mind filing a JIRA?
>>>
>>>                             Thanks Slava.
>>>                             St.Ack
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message