hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: TSocket: timed out reading 4 bytes from
Date Sun, 19 Jul 2009 06:35:07 GMT
On Sat, Jul 18, 2009 at 3:54 PM, Hegner, Travis <THegner@trilliumit.com>wrote:

> Thanks for the help stack...
>
> What should I use to study that file? a simple text editor? I seem to
> remember some binary info, but I could be mistaken.



Use HAT https://hat.dev.java.net/ or a commercial tool like jprofiler; the
latter allows you do tricks you can't with the former.



>
>
> Monday I plan to swap some of the slower/older desktops with some new ones,
> and with any luck i'll be able to average about 1GB per node. I may also
> consider doing a virtual machine (on real hardware) as a dedicated master
> which could get me one more dedicated regionserver/datanode. I may also be
> able to add another couple of desktops.
>

1GB per node is low still.

If such small machines, divide the tasks.  Run MR on different nodes from RS
and DN.


>
> Please also bear in mind that the OOME doesn't seem to happen every time.
> Do you think my over-running the current hardware configuration could be the
> cause of "getClosestRowBefore()" returning an empty result set?


This was happening when the cluster was loaded only, or throughout the
upload?


>
>
> The only reason I have a hard time believing that is because I have slowed
> my import down to only two rows per second and still seemed to run into the
> same exceptions. Could it be simply because I do have some decent sized rows
> going in?



In your heap dump, I saw some MB sized cells IIRC.  Could this be so?




>
>
> I will definitely do some hardware upgrades, as I have got some superiors
> convinced that further testing is a good idea.
>


>
> One last question: is there a way to configure the heap for zookeeper? I
> haven't found one (though I haven't looked very hard either).
>

There is (not sure how though -- if you don't figure it, ask again and one
of the lads who knows will respond).

St.Ack



>
> Thanks again,
>
> Travis Hegner
> http://www.travishegner.com/
> ________________________________________
> From: saint.ack@gmail.com [saint.ack@gmail.com] On Behalf Of stack [
> stack@duboce.net]
> Sent: Thursday, July 16, 2009 1:10 PM
> To: hbase-user@hadoop.apache.org; Hegner, Travis
> Subject: Re: TSocket: timed out reading 4 bytes from
>
> On Tue, Jul 14, 2009 at 12:49 PM, Travis Hegner <thegner@trilliumit.com
> <mailto:thegner@trilliumit.com>> wrote:
> Wouldn't you know, that as soon as I hit send on the last email... My
> import failed again with the same exceptions.
>
> After the OOME I mentioned, I started the import again and got the same
> results I've been getting since starting with 0.20. I will publish an hprof
> file from one of my crashed nodes (I'll send an off list link) for you to
> look at. Also remember that I posted a mile ago in this thread that my nodes
> are only about 512 MB of RAM, and I have hbase-env.sh capped at 200MB for
> the heap. It's all I could come up with for testing, but if it's not enough,
> I'll see if I can scrape up some more.
>
>
> Looking at your heap dump, it seems that regionserver was carrying 313
> regions.  Each region had one family only but I see that there were 3410
> store files open.  Your servers are probably being overrun by the upload
> since this averages out to about 10 files per family when two to three would
> be normal (Its probably much lumpier than ten per Store and the OOME
> probably happened when you tried compact a Store that had way more than ten
> store files).
>
> If you only gave hbase 200MB of RAM, then, I'd say you're getting pretty
> good usage out of your current hardware.
>
> It seems your dataset is > than what your hbase cluster can carry.
>
> Can you add more machines?
>
> St.Ack
>
>
>
>
> Attached is the thrift debug log, which shows the exceptions I've been
> referring to. Studying the log, each "Cache hit" you see comes up with each
> chunk of rows (50 in this test), that get successfully imported. During a
> region split you'll see 2 or 3 "No server address listed" exceptions, until
> the new region is found, and then the import carries on like nothing ever
> happened. When the import times out, you'll actually still see two or so "No
> server address listed" exceptions, and then you start getting a ridiculous
> number of "HRegion was null or empty in .META." exceptions. These new
> exceptions continue to pile in, sleeping between them for the listed amount
> of time per exception. This happens for nearly exactly 15 minutes before the
> exceptions finally quit. On first glances of the "locateRegionInMeta"
> function, in the "HConnectionManager" class, it seems that the function
> keeps calling itself with the same parameters, even though there is a loop
> to implement the retries. By doing so it seemingly creates an infinitely
> deep set of nested loops trying over and over again to find the region
> location that its searching for until some global timeout of 15 minutes
> happens. This pure speculation as I'm not completely familiar with anything
> in this code really.
>
> That brings me to my next observation which I mis-read in my last email.
> The call to "getClosestRowBefore" does not actually return null, it returns
> an empty data set of it's type (if it were null, we'd get a
> "TableNotFoundException"). Since it's an empty data set, the next function
> call, which is to, "regionInfoRow.getValue(CATALOG_FAMILY,
> REGIONINFO_QUALIFIER)", does return null (or 0 length) and results in the
> second exception mentioned above.
>
> I will begin digging into the "getClosestRowBefore" function and try to
> determine why, in my data set, it returns empty results, triggering the fate
> described before.
>
> Thanks for the help, and please let me know if there is anything else I can
> provide to aid in troubleshooting.
>
>
> Travis Hegner
> http://www.travishegner.com/
>
>
> -----Original Message-----
> From: stack <stack@duboce.net<mailto:stack%20%3cstack@duboce.net<stack%2520%253cstack@duboce.net>
> %3e>>
> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org> <
> hbase-user@hadoop.apache.org<mailto:%22hbase-user@hadoop.apache.org%
> 22%20%3chbase-user@hadoop.apache.org<22%2520%253chbase-user@hadoop.apache.org>%3e>>,
> Hegner, Travis <THegner@trilliumit.com<mailto:%22Hegner,%
> 20Travis%22%20%3cTHegner@trilliumit.com<20Travis%2522%2520%253cTHegner@trilliumit.com>
> %3e>>
> Subject: Re: TSocket: timed out reading 4 bytes from
> Date: Tue, 14 Jul 2009 12:33:44 -0400
>
> On Tue, Jul 14, 2009 at 9:16 AM, Travis Hegner <thegner@trilliumit.com
> <mailto:thegner@trilliumit.com>> wrote:
>
> I eventually got this problem narrowed down to the fact that I was using
> non fixed-length row id's, and not inserting them in order.
>
>
> This should be fine.
>
>
> Under these
> conditions, the hbase java client would throw an exception because in
> HConnectionManager.java, in the function "LocateRegionInMeta", the call
> to "GetClosestRowBefore()" was returning null.
>
>
> This is not good.  Would seem to point to something else going on.  There
> was an issue in TRUNK a week or so ago that messed up the .META. table, the
> table that getClosestRowBefore runs against.
>
>
> I didn't dig any deeper
> than that, but my assumption was that it was having problems searching
> through the existing keys because of the difference in row length, and
> being out of order. When (integer based) ordering the keys (all of my
> keys were integer ID's BTW), and then adding a fixed length '0' pad to
> the left, the client seemed to behave much better. At least as far as
> being able to find the proper regions and what-not. (I still received
> some timeouts, but with different exceptions and I never traced those.)
>
> Since all this, I have updated to the latest trunk (Revision: 793937),
> and set my keys to be non fixed-length, and non ordered (like my
> original tests), and I am getting the similar behavior as when they were
> fixed-length, and ordered. So, it looks as if something has been fixed
> in the client. This import has been running for a while, and has
> survived almost 10 region splits, so things are looking good. Now, if
> only I can find some more RAM to prevent those pesky OOME's, then I may
> actually be able to complete a full import.
>
>
> If you want to put the generated hprof file someplace that I can pull it,
> I'll take a look.  An OOME when the number of regions is small shouldn't be
> happening.   I'd like to check it out.
>
>
>
> I will keep on with my testing (err.. breaking things if you'd rather),
> and report back any findings.
>
>
> Thanks Travis.
>
>
>
> My only hope is that I am helping to make this thing a better product,
> because so far I think it's awesome, and I can't wait to try and
> convince my superiors that we need it for production, and put it on some
> real hardware.
>
>
>
> Good on you Travis,
> St.Ack
>
>
>
>
> Thanks,
>
> Travis Hegner
> http://www.travishegner.com/
>
>
>
> -----Original Message-----
> From: Travis Hegner <thegner@trilliumit.com<mailto:thegner@trilliumit.com
> >>
> Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>,
> "Hegner, Travis" <THegner@trilliumit.com<mailto:THegner@trilliumit.com>>
> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org> <
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> Subject: Re: TSocket: timed out reading 4 bytes from
>
>
> Date: Fri, 10 Jul 2009 12:59:15 -0400
>
>
> I've noticed while studying the full logs that the exceptions continue
> after the supposed 10 attempts. It gets to the 8 of 10 attempts, then it
> starts over again with attempt 0 of 10, and it continues this loop to 8
> and back to 0 for about 15 minutes.
>
> Travis Hegner
> http://www.travishegner.com/
>
> -----Original Message-----
> From: Travis Hegner <thegner@trilliumit.com<mailto:thegner@trilliumit.com
> >>
> Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>,
> "Hegner, Travis" <THegner@trilliumit.com<mailto:THegner@trilliumit.com>>
> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org> <
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> Subject: Re: TSocket: timed out reading 4 bytes from
> Date: Fri, 10 Jul 2009 11:57:34 -0400
>
> After figuring out how to enable debug logging, It seems that my problem
> is with the Hbase java client, or at least thrift's use of it.
>
> Please review the attached logs. The times should be fairly close, so
> you can see how they coordinate. On the client log, the first line shows
> up for each batch put. I only grabbed the last one as an example. Each
> batch put for this test was dumping 20 rows (~100-200k per row). The
> Exceptions that you see are only the first two of the 10 or so retries
> that it does... each retry is exactly the same. After my script times
> out, I can start it again, and get this same sequence of exceptions upon
> the initial attempt to put data.
>
> I made it through about 167 20 row puts before it did a region split and
> crashed with the attached exceptions.
>
> I am happy to provide anything else I can to assist in troubleshooting.
>
> Thanks,
>
> Travis Hegner
> http://www.travishegner.com/
>
>
> -----Original Message-----
> From: Travis Hegner <thegner@trilliumit.com<mailto:thegner@trilliumit.com
> >>
> Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>,
> "Hegner, Travis" <THegner@trilliumit.com<mailto:THegner@trilliumit.com>>
> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org> <
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> Subject: Re: TSocket: timed out reading 4 bytes from
> Date: Fri, 10 Jul 2009 10:30:30 -0400
>
>
> The overhead and speed isn't a problem. I can deal with the wait as long
> as the import works.
>
> I have tried throttling it down as slow as 2 rows per second with the
> same result (~100-200k per row). I have decreased the size of the rows
> (2-3k). I have even moved over the the "BatchMutate" object, with the
> mutateRows function in php, to do varying amounts of rows per connection
> (tried 100 and 1000), and I still end up with the same results. At some
> random point in time, the thrift server completely stops responding, and
> my client times out. I have moved the thrift server off of the cluster,
> and on to the same, much more powerful, machine that is running the php
> import script. The problem still occurs. About 90% of the time, a simple
> thrift server restart fixes it, but the other 10% has only allowed
> thrift client connections after dropping and re-creating the table. A
> bit more rarely, I'll even have to restart the entire Hbase cluster in
> order to drop the table. I get zero messages in the thrift logs, an only
> an indication from the master's logs that the problem occurs during a
> region split, even though the region splits successfully. The problem
> may or may not be with the actual thrift service, it could be deeper
> than that.
>
> I should also mention that I used the exact same script to connect to a
> single node hbase 0.19.3 machine (a 1GB RAM virtual machine) running
> thrift and the entire import ran without stopping once. In that test I
> imported 131,815 2-3k rows in one table, and about several hundred
> thousand 6byte rows, into a second table. That might be apples to
> oranges, but the 0.19 thrift server had no problem responding to every
> request, even through the life of the import (~30 hours).
>
> I realize that my conditions may not be ideal for performance, but at
> this point I simply need it to work, and I can tweak performance later.
>
> Has anyone else had the same/similar problem? Can anyone recommend
> another troubleshooting step?
>
> Thanks,
>
> Travis Hegner
> http://www.travishegner.com/
>
> -----Original Message-----
> From: Jonathan Gray <jlist@streamy.com<mailto:jlist@streamy.com>>
> Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>
> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org> <
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> Subject: Re: TSocket: timed out reading 4 bytes from
> Date: Thu, 9 Jul 2009 17:29:37 -0400
>
>
> It's not that it must be done from java, it's just that the other
> interfaces add a great deal of overhead and also do not let you do the
> same kind of batching that helps significantly with performance.
>
> If you don't care about the time it takes, then you could stick with
> thrift.  Try to throttle down the speed, or do it in separate batches
> with a break in between.
>
> Travis Hegner wrote:
> > I am not extremely java savvy quite yet... is there an alternative way
> > to access Hbase from PHP? I have read about the REST libraries, but
> > haven't tried them yet. Are they sufficient for bulk import? Or, is a
> > bulk import something that simply must be done from java, without
> > exception?
> >
> > Thanks for the help,
> >
> > Travis Hegner
> > http://www.travishegner.com/
> >
> > -----Original Message-----
> > From: Jonathan Gray <jlist@streamy.com<mailto:jlist@streamy.com>>
> > Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>
> > To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org> <
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> > Subject: Re: TSocket: timed out reading 4 bytes from
> > Date: Thu, 9 Jul 2009 16:54:22 -0400
> >
> >
> > My recommendation would be to not use thrift for bulk imports.
> >
> > Travis Hegner wrote:
> >> Of course, as luck should have it... I spoke too soon. I am still
> >> suffering from that region split problem, but it doesn't seem to happen
> >> on every region split.
> >>
> >> I do know for sure that with the final split, the new daughter regions
> >> were re-assigned to the original parent's server. It made it through
> >> about 18% (24756 rows) of my roughly 5 GB import before getting that
> >> vague timeout message again. None of my region servers have crashed or
> >> stopped at all, and a simple table count operation still works as
> >> expected. If I immediately restart the script, it times out on the first
> >> row, which already exists. The regionserver logs of the final split
> >> location show no errors or warnings, only successful split and
> >> compaction notifications. I have also moved all of the memstore flush
> >> and similar settings back to default with the trunk install.
> >>
> >> My php script issues the following exception when it times out:
> >>
> >> Fatal error: Uncaught exception 'TException' with message 'TSocket:
> >> timed out reading 4 bytes from hadoop1:9090'
> >> in /home/thegner/Desktop/thrift/lib/php/src/transport/TSocket.php:228
> >> Stack trace:
> >> #0
> /home/thegner/Desktop/thrift/lib/php/src/transport/TBufferedTransport.php(109):
> TSocket->readAll(4)
> >> #1
> /home/thegner/Desktop/thrift/lib/php/src/protocol/TBinaryProtocol.php(300):
> TBufferedTransport->readAll(4)
> >> #2
> /home/thegner/Desktop/thrift/lib/php/src/protocol/TBinaryProtocol.php(192):
> TBinaryProtocol->readI32(NULL)
> >> #3
> /home/thegner/Desktop/thrift/lib/php/src/packages/Hbase/Hbase.php(1017):
> TBinaryProtocol->readMessageBegin(NULL, 0, 0)
> >> #4
> /home/thegner/Desktop/thrift/lib/php/src/packages/Hbase/Hbase.php(984):
> HbaseClient->recv_mutateRow()
> >> #5 /home/thegner/Desktop/hbase_php/rtools-hbase.php(64):
> >> HbaseClient->mutateRow('Resumes', '21683', Array)
> >> #6 {main}
> >>   thrown
> >> in /home/thegner/Desktop/thrift/lib/php/src/transport/TSocket.php on
> >> line 228
> >>
> >> After stopping and restarting only the thrift server, it seems to be
> >> working again, so I suppose that is where we start looking.
> >> I should mention that my thrift client has both timeouts set to 20000
> >> ms, but I have had it set as high as 300000 still having the same
> >> problem.
> >>
> >> The tutorial I followed to get the thrift client up and running was
> >> perhaps a little dated, so I will make sure my thrift client code is up
> >> to date.
> >>
> >> Any other suggestions?
> >>
> >> Travis Hegner
> >> http://www.travishegner.com/
> >>
> >> -----Original Message-----
> >> From: Travis Hegner <thegner@trilliumit.com<mailto:
> thegner@trilliumit.com>>
> >> Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>,
> >> "Hegner, Travis" <THegner@trilliumit.com<mailto:THegner@trilliumit.com
> >>
> >> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>
<
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> >> Subject: Re: TSocket: timed out reading 4 bytes from
> >> Date: Thu, 9 Jul 2009 15:47:56 -0400
> >>
> >>
> >> Hi Again,
> >>
> >> Since the tests mentioned below, I have finally figured out how to build
> >> and run from the trunk. I have re-created my hbase install from svn,
> >> configured it, updated my thrift client library, and my current import
> >> has been through more than 5 region splits without failing.
> >>
> >> Next step, writing my first map-reduce jobs, then utilizing hbase as an
> >> input and output for those...
> >>
> >> Any recommended tutorials for that?
> >>
> >> Thanks again,
> >>
> >> Travis Hegner
> >> http://www.travishegner.com/
> >>
> >> -----Original Message-----
> >> From: Hegner, Travis <THegner@trilliumit.com<mailto:
> THegner@trilliumit.com>>
> >> Reply-to: "hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>" <hbase-user@hadoop.apache.org<mailto:
> hbase-user@hadoop.apache.org>>
> >> To: hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>
<
> hbase-user@hadoop.apache.org<mailto:hbase-user@hadoop.apache.org>>
> >> Subject: TSocket: timed out reading 4 bytes from
> >> Date: Thu, 9 Jul 2009 10:17:15 -0400
> >>
> >>
> >> Hi All,
> >>
> >> I am testing 0.20.0-alpha, r785472 and am coming up with an issue I
> can't seem to figure out. I am accessing hbase from php via thrift. The php
> script is pulling data from our pgsql server and dumping it into hbase.
> Hbase is running on a 6 node hadoop cluster (0.20.0-plus4681, r767961) with
> truly "commodity" nodes (3.0Ghz P4 HT Desktops, 512MB RAM each). My symptoms
> have seemed mostly sparatic, but I have finally got it to a point where it
> errors somewhat consistently. Due to my lack of RAM per node, I have dropped
> the HEAP for both hadoop, and hbase to about 200 MB each, and I have dropped
> the memcache.flush.size. I also dropped some of the other things regarding
> hstore file sizes, and the compaction threshold, trying to troubleshoot this
> problem.
> >>
> >> It seems that after I begin my import, everything works pretty well
> until a region is splits, which happens at roughly 1% of about a 5 GB import
> (I currently have my memstore flush at 16MB for troubleshooting). Once the
> region splits, my import times out with the 'TSocket: timed out reading 4
> bytes from' error. I've even set my import script to catch the exception,
> sleep 60 seconds, disconnect and reconnect, and try the import again and it
> still times out. If I immediately try running the script again, it will
> sometimes get through the first few, but usually will hit the same time out
> almost immediately, even though the current row already exists, and should
> be overwriting it in an existing region (Only one version per cell). I have
> tried restarting only the thrift service with the same results. Typically
> once I receive the error, I can't get a decent import started without
> restarting all of hbase, truncating the table, and starting over from
> scratch, only to have th
>  e sam
> e
> >  thing happen at the next region split.
> >> Initially, before I changed a lot of the sizes, it seemed I could get
> much further into the import (as much as 60%) before it would time out, but
> that was only importing partial data (about 700 MB total), so I'm not sure
> if the regions were ever splitting with those tests (I wasn't watching for
> it yet).
> >>
> >> With all that being said, it definitely seems to be consistently
> happening exactly when a region splits, and I've found no errors in the logs
> indicating a problem with the region splitting, it typically seems OK, and
> finishes compacting and everything before the script even times out. Yet it
> still times out even though I can scan and count the table without issue.
> >>
> >> Any input or info is greatly appreciated.
> >>
> >> Thanks,
> >>
> >> Travis Hegner
> >> http://www.travishegner.com/
> >>
> >
>
>
>
> ________________________________
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient. Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful. If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender or
> our IT Department at 866.459.4599.
>
>
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient.  Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful.  If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender or
> our IT Department at  866.459.4599.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message