hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Table goes offline - temporary outage + Retries Exhausted (related?)
Date Thu, 29 Jul 2010 23:36:16 GMT
There is some root cause behind the 'failed to flush' message... I'd
like to get to the root of that.  Unfortunately it means lots of log
groveling.  If you want to post logs, try pastebin.com instead of
trying to attach files.

Dig some dirt up and lets check it out :-)

-ryan

On Thu, Jul 29, 2010 at 4:25 PM, Stuart Smith <stu24mail@yahoo.com> wrote:
> Hello Ryan,
>
>  Thanks!
>
> Just to verify - my xceiver count is 4K, my ulimit reports 64000, my datanode handler
count is 15, my socket write timeout is zero, my swappiness is 1 on datanodes and 0 on the
namenode, and my memory has been tweaked according to the machines - hadoop and hbase both
get 3GB on 8GB RAM datanodes - leaving 2GB free. The namenode has 16GB and is split 6GB/6GB.
>
> After my last round of issues I went through the faq & a bunch of blogs - of which
some were yours I think - so thanks again :)
>
> I get
>
> Warning: failed to flush data to sample store: Trying to contact region server Some server,
retryOnlyOne=true, index=0, islastrow=false, tries=9, numtries=10, i=9, listsize=13, region=filestore,be4c6d071635b80ac649b7900167f6ddd7cc2dca3578ce8bc24fca523930e81c,1279956247376
for region filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
row 'be113824be800baddf62c27ac9cf12a57955a3582d7d8f53541017416cf18ed1', but failed after 10
attempts.
>
>  on about 20 files out of 6000 -  so right now I just redo the batch, and skip existing
entries, which works for now.
>
> What I think I need to do is come up with a nice set of java snippets that illustrate
my code, and re-post. But that might not happen right away.
>
> My app is this multi-threaded thingy that has thread pools with threads that have thread
pools, ftps in archives, extracts them, checks for dupes, other stuff, and uploads files.
Which is one reason I think it might be a client side thing ~ but I did wrap my puts with
synchronized( table ) {} ;)
>
> And, yes, for all the tweaking I've had to do on Hbase ~ it sure beats the time I needed
to alter an innodb table with about 800 million rows of blobs & stuff... took about a
week.
>
> Take care,
>  -stu
>
>
>
> --- On Thu, 7/29/10, Ryan Rawson <ryanobjc@gmail.com> wrote:
>
>> From: Ryan Rawson <ryanobjc@gmail.com>
>> Subject: Re: Table goes offline - temporary outage + Retries Exhausted  (related?)
>> To: user@hbase.apache.org
>> Date: Thursday, July 29, 2010, 6:40 PM
>> Hi,
>>
>> There is a lot going on in this email, the logs might look
>> promising
>> but they are standard split messages, not really indicative
>> of
>> anything going wrong.
>>
>> It sounds like you might be coming across some of the
>> standard foils
>> that are well documented in here:
>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#overview_description
>>
>> Perhaps you could confirm you have things like xceiver
>> count, and
>> ulimits set?    I personally use this on all my
>> clusters, maybe you
>> can try it again:
>> <property>
>> <name>dfs.datanode.socket.write.timeout</name>
>> <value>0</value>
>> </property>
>>
>> Lastly, I dont think that Put should be unreliable, I have
>> reliably
>> imported 10s of billions of rows, so there is something
>> else going on.
>>
>> -ryan
>> PS: mysql dbas spend tons of time setting up ulimits and
>> other
>> esoteric kernel tuning parameters, our requirement is
>> actually
>> surprisingly low in that regard.
>>
>> On Thu, Jul 29, 2010 at 3:02 PM, Stuart Smith <stu24mail@yahoo.com>
>> wrote:
>> > Hello all,
>> >
>> > It looks like I had an ensemble of unrelated errors.
>> >
>> > To follow up with the table going offline error:
>> >
>> > I noticed today the the gui will say: "Enabled False",
>> and the shell will say:
>> >
>> > hbase(main):004:0> describe 'filestore'
>> > DESCRIPTION
>>                                 ENABLED
>> >  {NAME => 'filestore', FAMILIES => [{NAME =>
>> 'content', COMPRESSION =>  false
>> >  'LZO', VERSIONS => '3', TTL => '2147483647',
>> BLOCKSIZE => '65536', IN_
>> >
>> > Soo... I'm not sure which is which - maybe it was
>> never disabled, depending on whether the gui or shell is
>> correct. It appears to be the shell, since I've been
>> uploading more data, and it's going through fine now.
>> >
>> > I'm guessing yesterday uploads were failing due to the
>> batch issues, and the gui reported the table as disabled,
>> and I connected the two issues incorrectly.
>> >
>> > Take care,
>> >  -stu
>> >
>> > --- On Thu, 7/29/10, Stuart Smith <stu24mail@yahoo.com>
>> wrote:
>> >
>> >> From: Stuart Smith <stu24mail@yahoo.com>
>> >> Subject: Re: Table goes offline - temporary outage
>> + Retries Exhausted (related?)
>> >> To: user@hbase.apache.org
>> >> Date: Thursday, July 29, 2010, 3:19 PM
>> >> To follow up on the retry error
>> >> (still have no idea about the table going
>> offline):
>> >>
>> >> It was coding error, sorta kinda.
>> >>
>> >> I was doing large batches with AutoFlush disabled,
>> and
>> >> flushing at the end, figuring I could gain
>> performance, and
>> >> just reprocess bad batches.
>> >>
>> >> Bad call.
>> >>
>> >> It appears I was consistently getting errors on
>> flush, so
>> >> the batch just kept failing. Now I flush after
>> every
>> >> successful file upload, and only one or two out of
>> a couple
>> >> thousand fail, and not consistently on one file,
>> so retries
>> >> are possible.
>> >>
>> >> I also added a 3 second sleep when I get some kind
>> of
>> >> IOException executing a a PUT on this particular
>> table. To
>> >> prevent some sort of cascade effect.
>> >>
>> >> That part is going pretty smooth now.
>> >>
>> >> Still don't know about the offline table thing -
>> crossing
>> >> my fingers and watching closely for now (and
>> adding nodes).
>> >>
>> >> I guess the moral of the first lesson is to really
>> treat
>> >> the Puts() as somewhat unreliable?
>> >>
>> >> Take care,
>> >>   -stu
>> >>
>> >>
>> >>
>> >> --- On Thu, 7/29/10, Stuart Smith <stu24mail@yahoo.com>
>> >> wrote:
>> >>
>> >> > From: Stuart Smith <stu24mail@yahoo.com>
>> >> > Subject: Table goes offline - temporary
>> outage +
>> >> Retries Exhausted (related?)
>> >> > To: user@hbase.apache.org
>> >> > Date: Thursday, July 29, 2010, 2:09 PM
>> >> > Hello,
>> >> >    I have two problems that may or may
>> not
>> >> > be related.
>> >> >
>> >> > One is trying to figure out a self-correcting
>> outage I
>> >> had
>> >> > last evening.
>> >> >
>> >> > I noticed issues starting with clients
>> reporting:
>> >> >
>> >> > RetriesExhaustedException: Trying to contact
>> region
>> >> server
>> >> > Some server...
>> >> >
>> >> > I didn't see much going on in the
>> regionserver logs,
>> >> except
>> >> > for some major compactions. Eventually I
>> decided to
>> >> check
>> >> > the status of the table being written to, and
>> it was
>> >> > disabled - and not by me (AFAIK).
>> >> >
>> >> > I tried enabling the table via the hbase
>> shell.. and
>> >> it was
>> >> > taking a long  time, so I left for the
>> evening. I
>> >> came
>> >> > back this morning, and the shell had
>> reported:
>> >> >
>> >> > hbase(main):002:0> enable 'filestore'
>> >> > NativeException: java.io.IOException: Unable
>> to
>> >> enable
>> >> > table filestore
>> >> >
>> >> > Except by now, the table was back up!
>> >> >
>> >> > After going through the logs a little more
>> closely,
>> >> the
>> >> > only thing I can find that seems correlated
>> (at least
>> >> by the
>> >> > timing):
>> >> >
>> >> > (in the namenode logs)
>> >> >
>> >> > 2010-07-28 18:39:17,213 INFO
>> >> >
>> org.apache.hadoop.hbase.master.ServerManager:
>> >> Processing
>> >> > MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
>> >> >
>> >>
>> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
>> >> > Daughters;
>> >> >
>> >>
>> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
>> >> >
>> >>
>> filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
>> >> > from ubuntu-hadoop-3,60020,1280263369525; 1
>> of 1
>> >> >
>> >> > ...
>> >> >
>> >> > 010-07-28 18:42:45,835 DEBUG
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> >> >
>> >>
>> filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
>> >> > no longer has references to
>> >> >
>> >>
>> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
>> >> > 2010-07-28 18:42:45,842 INFO
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> Deleting
>> >> region
>> >> >
>> >>
>> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
>> >> > (encoded=1245524105) because daughter splits
>> no longer
>> >> hold
>> >> > references
>> >> > ...
>> >> > 2010-07-28 18:59:39,000 DEBUG
>> >> >
>> org.apache.hadoop.hbase.master.ChangeTableState:
>> >> Processing
>> >> > unserved regions
>> >> > 2010-07-28 18:59:39,001 DEBUG
>> >> >
>> org.apache.hadoop.hbase.master.ChangeTableState:
>> >> Skipping
>> >> > region REGION => {NAME =>
>> >> >
>> >>
>> 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
>> >> > STARTKEY =>
>> >> >
>> >>
>> '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e',
>> >> > ENDKEY =>
>> >> >
>> >>
>> '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1',
>> >> > ENCODED => 1808201339, OFFLINE => true,
>> SPLIT
>> >> =>
>> >> > true, TABLE => {{NAME => 'filestore',
>> FAMILIES
>> >> =>
>> >> > [{NAME => 'content', COMPRESSION =>
>> 'LZO',
>> >> VERSIONS
>> >> > => '3', TTL => '2147483647', BLOCKSIZE
>> =>
>> >> '65536',
>> >> > IN_MEMORY => 'false', BLOCKCACHE =>
>> 'true'}]}}
>> >> because
>> >> > it is offline and split
>> >> > ...
>> >> > 010-07-28 18:59:39,001 DEBUG
>> >> >
>> org.apache.hadoop.hbase.master.ChangeTableState:
>> >> Processing
>> >> > regions currently being served
>> >> > 2010-07-28 18:59:39,002 DEBUG
>> >> >
>> org.apache.hadoop.hbase.master.ChangeTableState:
>> >> Already
>> >> > online
>> >> >
>> >> > ...
>> >> > 010-07-28 19:00:34,485 INFO
>> >> > org.apache.hadoop.hbase.master.ServerManager:
>> 4
>> >> region
>> >> > servers, 0 dead, average load 1060.0
>> >> > 2010-07-28 19:00:49,850 INFO
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> >> > RegionManager.rootScanner scanning meta
>> region
>> >> {server:
>> >> > 192.168.193.67:60020, regionname: -ROOT-,,0,
>> >> startKey:
>> >> > <>}
>> >> > 2010-07-28 19:00:49,858 INFO
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> >> > RegionManager.rootScanner scan of 1 row(s) of
>> meta
>> >> region
>> >> > {server: 192.168.193.67:60020, regionname:
>> -ROOT-,,0,
>> >> > startKey: <>} complete
>> >> > 2010-07-28 19:01:06,981 DEBUG
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> >> >
>> >>
>> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
>> >> > no longer has references to
>> >> >
>> >>
>> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
>> >> > ...
>> >> > I'm not really sure, but I saw these messages
>> toward
>> >> the
>> >> > end:
>> >> > ...
>> >> > 2010-07-28 19:18:31,029 DEBUG
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> >> >
>> >>
>> filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
>> >> > no longer has references to
>> >> >
>> >>
>> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
>> >> > 2010-07-28 19:18:31,061 INFO
>> >> > org.apache.hadoop.hbase.master.BaseScanner:
>> Deleting
>> >> region
>> >> >
>> >>
>> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
>> >> > (encoded=597566178) because daughter splits
>> no longer
>> >> hold
>> >> > references
>> >> > 2010-07-28 19:18:31,061 DEBUG
>> >> >
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> DELETING
>> >> > region
>> >> >
>> >>
>> hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
>> >> > ...
>> >> > Which may correspond to the time when it was
>> >> recovering (if
>> >> > so, I just missed it coming back online).
>> >> > ...
>> >> >
>> >> > As a final note, I re-ran some of the clients
>> today,
>> >> and it
>> >> > appears some are OK, and some consistently
>> give:
>> >> >
>> >> > Error: io exception when loading file:
>> >> >
>> /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
>> >> >
>> >>
>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
>> >> > Trying to contact region server Some server,
>> >> > retryOnlyOne=true, index=0, islastrow=false,
>> tries=9,
>> >> > numtries=10, i=3, listsize=7,
>> >> >
>> >>
>> region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
>> >> > for region
>> >> >
>> >>
>> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
>> >> > row
>> >> >
>> >>
>> 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f',
>> >> > but failed after 10 attempts
>> >> >
>> >> > So while the above is the error that brought
>> the
>> >> offline
>> >> > table to my attention - it may just be a
>> separate bug?
>> >>
>> >> >
>> >> > Not sure what causes it, but since it
>> happens
>> >> consistently
>> >> > in a program being run with one set of
>> arguments, but
>> >> not
>> >> > another, I'm thinking it's an error on my
>> part.
>> >> >
>> >> > Any ideas on what could cause the table to
>> go
>> >> offline?
>> >> > Any common mistakes that lead to
>> RetriesExhausted
>> >> errors?
>> >> >
>> >> > The Retry errors occurred in a shared method
>> that
>> >> uploads a
>> >> > file to the filestore, so I'm not sure what
>> causes it
>> >> to
>> >> > fail in one case, but not another. Maybe just
>> the size
>> >> of
>> >> > the file? (@300K).
>> >> >
>> >> > Thanks!
>> >> >
>> >> > Take care,
>> >> >   -stu
>> >> >
>> >> >
>> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>>
>
>
>
>

Mime
View raw message