hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Kennedy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner
Date Wed, 27 Jun 2007 21:22:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508658
] 

James Kennedy commented on HADOOP-1439:
---------------------------------------

Michael suggested that when finished, Hadoop-1531, RowFilters, may be used to achieve the
above functionality.

As the RowFilter impl is right now, using a regexp on each key encountered may be an expensive
way to do it.

In the above example, even if the endRow functionality works, how do you know where the end
row is? how do you know when you leave the google domain?

It seems to me that there may be several restrictions a user may want to apply to row-keys:
1) Specify a range. Use start/end keys assuming you know what they are.
2) Specify a range, use a start key and a "page size".  This is useful for retrieving data
in pages, e.g. displaying to UI as user clicks next/last page.
3) Specify a criteria. e.g. regular expressions or more basic string comparison.

Fortunately my RowFilterInterface design can be used to generalize the above.  In the Google
example, I could create a custom RowFilter implementation that can do domain name comparison
more efficiently than general regular expression matching.  Pass that via the client as you
would any other RowFilter impl.  Only thing to make sure of is that the custom impl is in
the classpath of the HRegionServer too.

For start/end range, you could have a custom RowFilter that checks for an exact match on the
end key. But this won't be as efficient as an explicit endRow parameter because:
A) when RowFilter is not null, HRegion#HScanner is always going to have a little more overhead
even if the filter() implementation itself always just returns false.
B) The filter isn't currently designed to stop the scanner when a certain criteria is reached.
When it encounters the endRow, it will just loop through the rest of the rows, filtering them
all out, until it reaches the end of the HRegion.

I think start/page range has the same issues.  Only difference is that it requires scan-lifetime
state to count number of (unfiltered?) rows encountered.  Still requires stop condition trigger.

If i add that stop condition trigger functionality to the RowFilterInterface and update HScanner
to use it. We could have a number of built-in RowFilter implementations that deal with restrictions
like those above.

WRT simple restrictions like start/end/page parameters there will still be a, perhaps small,
trade-off between performance and generality depending on if we implement them independently
or via RowFilterInterface.











> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow) throws
IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text startRow, Text
endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's pages.  Currently,
client could cut off the scanner as soon as the row key leaves the google domain but cleaner
if {{HScannerInterface#next()}} returns false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message