hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray (JIRA)" <j...@apache.org>
Subject [jira] Reopened: (HBASE-1177) Delay when client is located on the same node as the regionserver
Date Fri, 15 May 2009 22:35:45 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jonathan Gray reopened HBASE-1177:

Jim, I don't believe this issue should be closed.  It may not have anything to do with our
code, but it affects us in a very significant way so we need to get to the bottom of it. 
This is an exploratory issue, hopefully with a solution, but we're not done with it yet.

Your conclusion is not correct.  You are writing off the delay as context switching that occurs
when client is on the same machine.  First of all, those context switches are orders of magnitude
below the timings of these queries.  The queries in question run 40 times slower (seems to
be something about 4000ms, dunno what) when running local to the hosting node.  This amount
of time is clearly not explainable by the additional context switching of having these two
things running concurrently.

But more importantly, this explanation does not address what we're seeing.  Given what you
say above, we should be seeing ALL queries running slower by some fixed factor when running
on the same node.  But we don't.  There is a very specific and definable range of payload
sizes for which this extra delay of ~4 seconds exists.  The 7 column case and the 1000 column
case both perform nearly identical in both situations, so the affect of the context switching
is negligible.

Have you done network-level debugging?  We need to figure out where in the chain the delay
is introduced and go from there.

There could be an issue in Linux, RPC, who knows... but we should keep digging whether or
not we figure this out for 0.20

> Delay when client is located on the same node as the regionserver
> -----------------------------------------------------------------
>                 Key: HBASE-1177
>                 URL: https://issues.apache.org/jira/browse/HBASE-1177
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>         Environment: Linux 2.6.25 x86_64
>            Reporter: Jonathan Gray
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.20.0
>         Attachments: ReadDelayTest.java
> During testing of HBASE-80, we uncovered a strange 40ms delay for random reads.  We ran
a series of tests and found that it only happens when the client is on the same node as the
RS and for a certain range of payloads (not specifically related to number of columns or size
of them, only total payload).  It appears to be precisely 40ms every time.
> Unsure if this is particular to our architecture, but it does happen on all nodes we've
tried.  Issue completely goes away with very large payloads or moving the client.
> Will post a test program tomorrow if anyone can test on a different architecture.
> Making a blocker for 0.20.  Since this happens when you have an MR task running local
to the RS, and this is what we try to do, might also consider making this a blocker for 0.19.1.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message