hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-2180) read performance from synchronizing hfile.fddatainputstream
Date Fri, 05 Feb 2010 07:44:27 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

stack updated HBASE-2180:

    Attachment: 2180-v2.patch

This patch includes fixes for tests making them use new getScanner method and includes small
PE fix when --rows is small (We would NPE).  I might need a v3.  A test is failing (TestGetDeleteTracker).
 Need to investigate.

In testing on something that tries to resemble the yahoo papers testing -- ~20M rows per server,
116 regions on a RS and only one replica -- this patch seems to double the throughput if ~20
concurrent clients on a RS.  I tested scans and scan speeds are what they were w/ this patch
in place.  They have not deterioated.

One thing I noticed was that scanning when the data is not local -- i.e. the data is in a
DN on another machine -- there is added latency for sure.... taking maybe 25% as long again
for the test to complete.  I need to see if same is true of random reads.  Cosmin suggested
that the yahoo test with its single replica only might be doing lots of remote accessing and
could be incurring the extra latency.

> read performance from synchronizing hfile.fddatainputstream
> -----------------------------------------------------------
>                 Key: HBASE-2180
>                 URL: https://issues.apache.org/jira/browse/HBASE-2180
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>         Attachments: 2180-v2.patch, 2180.patch
> deep in the HFile read path, there is this code:
>     synchronized (in) {
>       in.seek(pos);
>       ret = in.read(b, off, n);
>     }
> this makes it so that only 1 read per file per thread is active. this prevents the OS
and hardware from being able to do IO scheduling by optimizing lots of concurrent reads. 
> We need to either use a reentrant API (pread may be partially reentrant according to
Todd) or use multiple stream objects, 1 per scanner/thread.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message