hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anil Gupta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7474) Endpoint Implementation to support Scans with Sorting of Rows based on column values(similar to "order by" clause of RDBMS)
Date Thu, 03 Jan 2013 00:28:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542597#comment-13542597
] 

Anil Gupta commented on HBASE-7474:
-----------------------------------

[~tlipcon]
Hi Todd,
Let's walk through an example and I hope you have gone through the doc attached to Jira.

Example: Table has 20 million rows divided among 10 regions(2 million rows per region). 
We want to sort on a column that stores Long Value and get the 20 max values. 500k row are
satisfied the 
scan filters.

Case1: If the scans don't span multiple regions then
Case 1.1: No Coprocessor
RegionServer needs to transfer 500K across the network to client.
Case 1.2: With Coprocessor
RegionServer will sort the top 20 among 500K rows and only return 20 rows.

Case2:If the scan spans multiple regions then lets assume the 250K rows in region1 and 250k
rows in region2 are satisfied by scanner 
Case 1: No Coprocessor
Region1 will transfer 250K rows to client.
Region2 will transfer 250K rows to client.
Client will sort top 20 among 500K rows.
Case 2: With Coprocessor
Region1 will sort the top 20 among 250K rows and only return 20 rows to client.
Region2 will sort the top 20 among 250K rows and only return 20 rows to client.
Client will perform the merge sort on the results from region1 and region2.

The network I/O difference is huge. IMO, it is not possible to implement sorting in HBase
without coprocessor. The client will keep on dying due to Network I/O and extreme memory load
if we don't do server side processing.

I understand you concern that its an extra load on the server-side. But, currently there is
no better way to achieve it.

If you have any other better idea to implement this in HBase, i would be glad to have a look
at that.

Lastly, its a co-processor so it wont be enabled by default. User's who need it will enable
this and they will do their due diligence in Tuning the cluster for their use case.   

Thanks,
Anil Gupta
Software Engineer II, Intuit, inc

                
> Endpoint Implementation to support Scans with Sorting of Rows based on column values(similar
to "order by" clause of RDBMS)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7474
>                 URL: https://issues.apache.org/jira/browse/HBASE-7474
>             Project: HBase
>          Issue Type: New Feature
>          Components: Coprocessors, Scanners
>    Affects Versions: 0.94.3
>            Reporter: Anil Gupta
>            Priority: Minor
>              Labels: coprocessors, scan, sort
>             Fix For: 0.94.5
>
>         Attachments: SortingEndpoint_high_level_flowchart.pdf
>
>
> Recently, i have developed an Endpoint which can sort the Results(rows) on the basis
of column values. This functionality is similar to "order by" clause of RDBMS. I will be submitting
this Patch for HBase0.94.3
> I am almost done with the initial development and testing of feature. But, i need to
write the JUnits for this. I will also try to make design doc.
> Thanks,
> Anil Gupta
> Software Engineer II, Intuit, inc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message