hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13099) Scans as in DynamoDB
Date Wed, 25 Feb 2015 19:19:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337017#comment-14337017

Enis Soztutar commented on HBASE-13099:

I think we may have to keep at least some state in the server, even if we do a cell-based
scanner. Our contract is per-row atomicity, so we have to keep track of: 
1. read point while scanning inside a row. 
2. low watermark for the read points across all "open" scanners for the region. 

(1) can even be extended to be a region based contract if we consider atomic updates cross-row
using the MultiRowMutationEndpoint. (2) is needed for effectively getting rid of seqId's of
cells in hfiles. 

We keep (1) in the server side right now, and we use the row-based scanner contract for (1).
The client either gets the whole row, or not. The scanner can be restarted across rows, which
changes the scanner read point, but it is fine since there is no guarantees across rows for
visibility (excluding single region multi-row transactions). 

>From a semantics point of view, (1) can be achieved with sending the read point to the
client everytime a scan is started within a region. The client will keep track of 1 read point
per region. Any subsequent scans performed from the client in the region will also send this
read point to the server so that the scan does not see partial data. (2) can be solved by
either not deleting seqId's of cells in hfiles (which we do to optimize disk usage), or keeping
track of all open scanners' read points which requires still some state (even though very
small) in the server. 

> Scans as in DynamoDB
> --------------------
>                 Key: HBASE-13099
>                 URL: https://issues.apache.org/jira/browse/HBASE-13099
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: Client, regionserver
>            Reporter: Nicolas Liochon
> cc: [~saint.ack@gmail.com] - as discussed offline.
> DynamoDB has a very simple way to manage scans server side:
> ??citation??
> The data returned from a Query or Scan operation is limited to 1 MB; this means that
if you scan a table that has more than 1 MB of data, you'll need to perform another Scan operation
to continue to the next 1 MB of data in the table.
> If you query or scan for specific attributes that match values that amount to more than
1 MB of data, you'll need to perform another Query or Scan request for the next 1 MB of data.
To do this, take the LastEvaluatedKey value from the previous request, and use that value
as the ExclusiveStartKey in the next request. This will let you progressively query or scan
for new data in 1 MB increments.
> When the entire result set from a Query or Scan has been processed, the LastEvaluatedKey
is null. This indicates that the result set is complete (i.e. the operation processed the
“last page” of data).
> ??citation??
> This means that there is no state server side: the work is done client side.

This message was sent by Atlassian JIRA

View raw message