hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9272) A simple parallel, unordered scanner
Date Thu, 22 Aug 2013 06:14:52 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747277#comment-13747277

Lars Hofhansl commented on HBASE-9272:

I was thinking about this too (i.e. keeping all RSs busy). On the other hand I was trying
to keep this simple, assuming that in most cases the region to server assignment would be
more or less random.
With some number of threads and a reasonable sized cluster (without which a parallel scanner
does not help much anyway), one would assume a fairly nice load distribution.

So a test with more regions should see the same speedup, there is nothing inherently costly
per region (the ClientScanners will need to find the region again, but it should be cached).

There are other considerations too. For example, instead of having a task per Region, one
could split the requested rowkey space into N slices (using the region boundaries as a poor-mans
histogram by assuming that all regions will be of roughly the same size in bytes). In that
case one would keep the number of threads unlimited but instead limit the number of tasks
(i.e. slices).

(also above the penalty was 2.2% rather than 1.5% - but that was just a single run anyway)

Will do a test with more regions.

> A simple parallel, unordered scanner
> ------------------------------------
>                 Key: HBASE-9272
>                 URL: https://issues.apache.org/jira/browse/HBASE-9272
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>            Priority: Minor
>         Attachments: ParallelClientScanner.java, ParallelClientScanner.java
> The contract of ClientScanner is to return rows in sort order. That limits the order
in which region can be scanned.
> I propose a simple ParallelScanner that does not have this requirement and queries regions
in parallel, return whatever gets returned first.
> This is generally useful for scans that filter a lot of data on the server, or in cases
where the client can very quickly react to the returned data.
> I have a simple prototype (doesn't do error handling right, and might be a bit heavy
on the synchronization side - it used a BlockingQueue to hand data between the client using
the scanner and the threads doing the scanning, it also could potentially starve some scanners
long enugh to time out at the server).
> On the plus side, it's only a 130 lines of code. :)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message