hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Forehand <luke.foreh...@networkedinsights.com>
Subject Re: Hanging regionservers
Date Sun, 18 Jul 2010 20:36:52 GMT
I experienced the hang on my second job attempt.  I will be pastebinning stacktraces and logs
of all three servers tonight.  The datanode log of one of the servers is way bigger than the
rest and that's all the analysis I've done so far.  Meeting with cloudera on Monday and they'll
probably want me to migrate to CDH3.  Need to mow the lawn...  I'll report back soon.


On 7/16/10 6:34 PM, "Ryan Rawson" <ryanobjc@gmail.com> wrote:

According to Todd, there is some kind of weird Thread coordination
issue which is worked around by setting the timeout to 0, even though
we actually arent hitting any timeouts in the failure case.

And it might have been fixed in cdh3.  I haven't had chance to run it
yet so I can't say.


On Fri, Jul 16, 2010 at 3:32 PM, Stack <stack@duboce.net> wrote:
> So, it seems like you are by-passing issue by having no time out on
> the socket.  Would be for sure interested though if you have the issue
> still on cdh3b2.  Most folks will not be running with no socket
> timeout.
> Thanks Luke.
> St.Ack
> On Fri, Jul 16, 2010 at 3:01 PM, Luke Forehand
> <luke.forehand@networkedinsights.com> wrote:
>> Using Ryan Rawson's suggested config tweaks, we have just completed a successful
job run with a 15GB sequence file, no hang.  I'm setting up to have multiple files process
this weekend with the new settings.  :-)  I believe the dfs socket write timeout being indefinite
was the trick.
>> I'll post my results on Monday.  Thanks for the support thus far!
>> -Luke
>> On 7/15/10 10:17 PM, "Ryan Rawson" <ryanobjc@gmail.com> wrote:
>> I'm not seeing anything in that logfile, you are seeing compactions
>> for various regions, but im not seeing flushes (typical during insert
>> loads) and nothing else. One thing we look to see is a log message
>> "Blocking updates" which indicates that a particular region has
>> decided it's holding up to prevent taking too many inserts.
>> Like I said, you could be seeing this on a different regionserver, if
>> all the clients are blocked on 1 regionserver and can't get to the
>> others then most will look idle and only one will actually show
>> anything interesting in the log.
>> Can you check for this behaviour?
>> Also if you want to tweak the config with the values I pasted that should help.
>> On Thu, Jul 15, 2010 at 7:25 PM, Luke Forehand
>> <luke.forehand@networkedinsights.com> wrote:
>>> It looks like we are going straight from the default config, no expicit setting
of anything.
>>> On 7/15/10 9:03 PM, "Ryan Rawson" <ryanobjc@gmail.com> wrote:
>>> In this case the regionserver isn't actually doing anything - all the
>>> IPC thread handlers are waiting in their queue handoff thingy (how
>>> they get socket/work to do).
>>> Something elsewhere perhaps?  Check the logs of your jobs, there might
>>> be something interesting there.
>>> One thing that frequently happens is you overrun 1 regionserver with
>>> edits and it isnt flushing fast enough, so it pauses updates and all
>>> clients end up stuck on it.
>>> What was that config again?  I use these settings:
>>> <property>
>>>  <name>hbase.hstore.blockingStoreFiles</name>
>>>  <value>15</value>
>>> </property>
>>> <property>
>>>  <name>dfs.datanode.socket.write.timeout</name>
>>>  <value>0</value>
>>> </property>
>>> <property>
>>>  <name>hbase.hregion.memstore.block.multiplier</name>
>>>  <value>8</value>
>>> </property>
>>> perhaps try these ones?
>>> -ryan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message