hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Hegner <theg...@trilliumit.com>
Subject Re: TSocket: timed out reading 4 bytes from
Date Fri, 10 Jul 2009 15:57:34 GMT
After figuring out how to enable debug logging, It seems that my problem
is with the Hbase java client, or at least thrift's use of it.

Please review the attached logs. The times should be fairly close, so
you can see how they coordinate. On the client log, the first line shows
up for each batch put. I only grabbed the last one as an example. Each
batch put for this test was dumping 20 rows (~100-200k per row). The
Exceptions that you see are only the first two of the 10 or so retries
that it does... each retry is exactly the same. After my script times
out, I can start it again, and get this same sequence of exceptions upon
the initial attempt to put data.

I made it through about 167 20 row puts before it did a region split and
crashed with the attached exceptions.

I am happy to provide anything else I can to assist in troubleshooting.

Thanks,

Travis Hegner
http://www.travishegner.com/


-----Original Message-----
From: Travis Hegner <thegner@trilliumit.com>
Reply-to: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>,
"Hegner, Travis" <THegner@trilliumit.com>
To: hbase-user@hadoop.apache.org <hbase-user@hadoop.apache.org>
Subject: Re: TSocket: timed out reading 4 bytes from
Date: Fri, 10 Jul 2009 10:30:30 -0400


The overhead and speed isn't a problem. I can deal with the wait as long
as the import works.

I have tried throttling it down as slow as 2 rows per second with the
same result (~100-200k per row). I have decreased the size of the rows
(2-3k). I have even moved over the the "BatchMutate" object, with the
mutateRows function in php, to do varying amounts of rows per connection
(tried 100 and 1000), and I still end up with the same results. At some
random point in time, the thrift server completely stops responding, and
my client times out. I have moved the thrift server off of the cluster,
and on to the same, much more powerful, machine that is running the php
import script. The problem still occurs. About 90% of the time, a simple
thrift server restart fixes it, but the other 10% has only allowed
thrift client connections after dropping and re-creating the table. A
bit more rarely, I'll even have to restart the entire Hbase cluster in
order to drop the table. I get zero messages in the thrift logs, an only
an indication from the master's logs that the problem occurs during a
region split, even though the region splits successfully. The problem
may or may not be with the actual thrift service, it could be deeper
than that.

I should also mention that I used the exact same script to connect to a
single node hbase 0.19.3 machine (a 1GB RAM virtual machine) running
thrift and the entire import ran without stopping once. In that test I
imported 131,815 2-3k rows in one table, and about several hundred
thousand 6byte rows, into a second table. That might be apples to
oranges, but the 0.19 thrift server had no problem responding to every
request, even through the life of the import (~30 hours).

I realize that my conditions may not be ideal for performance, but at
this point I simply need it to work, and I can tweak performance later.

Has anyone else had the same/similar problem? Can anyone recommend
another troubleshooting step?

Thanks,

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: Jonathan Gray <jlist@streamy.com>
Reply-to: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
To: hbase-user@hadoop.apache.org <hbase-user@hadoop.apache.org>
Subject: Re: TSocket: timed out reading 4 bytes from
Date: Thu, 9 Jul 2009 17:29:37 -0400


It's not that it must be done from java, it's just that the other 
interfaces add a great deal of overhead and also do not let you do the 
same kind of batching that helps significantly with performance.

If you don't care about the time it takes, then you could stick with 
thrift.  Try to throttle down the speed, or do it in separate batches 
with a break in between.

Travis Hegner wrote:
> I am not extremely java savvy quite yet... is there an alternative way
> to access Hbase from PHP? I have read about the REST libraries, but
> haven't tried them yet. Are they sufficient for bulk import? Or, is a
> bulk import something that simply must be done from java, without
> exception?
> 
> Thanks for the help,
> 
> Travis Hegner
> http://www.travishegner.com/
> 
> -----Original Message-----
> From: Jonathan Gray <jlist@streamy.com>
> Reply-to: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
> To: hbase-user@hadoop.apache.org <hbase-user@hadoop.apache.org>
> Subject: Re: TSocket: timed out reading 4 bytes from
> Date: Thu, 9 Jul 2009 16:54:22 -0400
> 
> 
> My recommendation would be to not use thrift for bulk imports.
> 
> Travis Hegner wrote:
>> Of course, as luck should have it... I spoke too soon. I am still
>> suffering from that region split problem, but it doesn't seem to happen
>> on every region split.
>>
>> I do know for sure that with the final split, the new daughter regions
>> were re-assigned to the original parent's server. It made it through
>> about 18% (24756 rows) of my roughly 5 GB import before getting that
>> vague timeout message again. None of my region servers have crashed or
>> stopped at all, and a simple table count operation still works as
>> expected. If I immediately restart the script, it times out on the first
>> row, which already exists. The regionserver logs of the final split
>> location show no errors or warnings, only successful split and
>> compaction notifications. I have also moved all of the memstore flush
>> and similar settings back to default with the trunk install.
>>
>> My php script issues the following exception when it times out:
>>
>> Fatal error: Uncaught exception 'TException' with message 'TSocket:
>> timed out reading 4 bytes from hadoop1:9090'
>> in /home/thegner/Desktop/thrift/lib/php/src/transport/TSocket.php:228
>> Stack trace:
>> #0 /home/thegner/Desktop/thrift/lib/php/src/transport/TBufferedTransport.php(109):
TSocket->readAll(4)
>> #1 /home/thegner/Desktop/thrift/lib/php/src/protocol/TBinaryProtocol.php(300): TBufferedTransport->readAll(4)
>> #2 /home/thegner/Desktop/thrift/lib/php/src/protocol/TBinaryProtocol.php(192): TBinaryProtocol->readI32(NULL)
>> #3 /home/thegner/Desktop/thrift/lib/php/src/packages/Hbase/Hbase.php(1017): TBinaryProtocol->readMessageBegin(NULL,
0, 0)
>> #4 /home/thegner/Desktop/thrift/lib/php/src/packages/Hbase/Hbase.php(984): HbaseClient->recv_mutateRow()
>> #5 /home/thegner/Desktop/hbase_php/rtools-hbase.php(64):
>> HbaseClient->mutateRow('Resumes', '21683', Array)
>> #6 {main}
>>   thrown
>> in /home/thegner/Desktop/thrift/lib/php/src/transport/TSocket.php on
>> line 228
>>
>> After stopping and restarting only the thrift server, it seems to be
>> working again, so I suppose that is where we start looking.
>> I should mention that my thrift client has both timeouts set to 20000
>> ms, but I have had it set as high as 300000 still having the same
>> problem.
>>
>> The tutorial I followed to get the thrift client up and running was
>> perhaps a little dated, so I will make sure my thrift client code is up
>> to date.
>>
>> Any other suggestions?
>>
>> Travis Hegner
>> http://www.travishegner.com/ 
>>
>> -----Original Message-----
>> From: Travis Hegner <thegner@trilliumit.com>
>> Reply-to: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>,
>> "Hegner, Travis" <THegner@trilliumit.com>
>> To: hbase-user@hadoop.apache.org <hbase-user@hadoop.apache.org>
>> Subject: Re: TSocket: timed out reading 4 bytes from
>> Date: Thu, 9 Jul 2009 15:47:56 -0400
>>
>>
>> Hi Again,
>>
>> Since the tests mentioned below, I have finally figured out how to build
>> and run from the trunk. I have re-created my hbase install from svn,
>> configured it, updated my thrift client library, and my current import
>> has been through more than 5 region splits without failing.
>>
>> Next step, writing my first map-reduce jobs, then utilizing hbase as an
>> input and output for those...
>>
>> Any recommended tutorials for that?
>>
>> Thanks again,
>>
>> Travis Hegner
>> http://www.travishegner.com/
>>
>> -----Original Message-----
>> From: Hegner, Travis <THegner@trilliumit.com>
>> Reply-to: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
>> To: hbase-user@hadoop.apache.org <hbase-user@hadoop.apache.org>
>> Subject: TSocket: timed out reading 4 bytes from
>> Date: Thu, 9 Jul 2009 10:17:15 -0400
>>
>>
>> Hi All,
>>
>> I am testing 0.20.0-alpha, r785472 and am coming up with an issue I can't seem to
figure out. I am accessing hbase from php via thrift. The php script is pulling data from
our pgsql server and dumping it into hbase. Hbase is running on a 6 node hadoop cluster (0.20.0-plus4681,
r767961) with truly "commodity" nodes (3.0Ghz P4 HT Desktops, 512MB RAM each). My symptoms
have seemed mostly sparatic, but I have finally got it to a point where it errors somewhat
consistently. Due to my lack of RAM per node, I have dropped the HEAP for both hadoop, and
hbase to about 200 MB each, and I have dropped the memcache.flush.size. I also dropped some
of the other things regarding hstore file sizes, and the compaction threshold, trying to troubleshoot
this problem.
>>
>> It seems that after I begin my import, everything works pretty well until a region
is splits, which happens at roughly 1% of about a 5 GB import (I currently have my memstore
flush at 16MB for troubleshooting). Once the region splits, my import times out with the 'TSocket:
timed out reading 4 bytes from' error. I've even set my import script to catch the exception,
sleep 60 seconds, disconnect and reconnect, and try the import again and it still times out.
If I immediately try running the script again, it will sometimes get through the first few,
but usually will hit the same time out almost immediately, even though the current row already
exists, and should be overwriting it in an existing region (Only one version per cell). I
have tried restarting only the thrift service with the same results. Typically once I receive
the error, I can't get a decent import started without restarting all of hbase, truncating
the table, and starting over from scratch, only to have the sam
e
>  thing happen at the next region split.
>> Initially, before I changed a lot of the sizes, it seemed I could get much further
into the import (as much as 60%) before it would time out, but that was only importing partial
data (about 700 MB total), so I'm not sure if the regions were ever splitting with those tests
(I wasn't watching for it yet).
>>
>> With all that being said, it definitely seems to be consistently happening exactly
when a region splits, and I've found no errors in the logs indicating a problem with the region
splitting, it typically seems OK, and finishes compacting and everything before the script
even times out. Yet it still times out even though I can scan and count the table without
issue.
>>
>> Any input or info is greatly appreciated.
>>
>> Thanks,
>>
>> Travis Hegner
>> http://www.travishegner.com/
>>
> 

Mime
View raw message