hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pankaj Gupta <pankaj.ro...@gmail.com>
Subject querying data on the basis of timestamp
Date Thu, 14 Mar 2013 22:53:16 GMT
Hi,

I have a question regarding query performance for rows greater than a timestamp. The use case
is this:
I want to find all the rows in a key range that have changed after a certain timestamp and
upto a certain timestamp, i.e. exactly using this SCAN api:
Scan	setTimeRange(long minStamp, long maxStamp) 
          Get versions of columns only within the specified timestamp range, [minStamp, maxStamp)

Would this query go through all the rows in the key range or is there an optimization that
makes it faster. 

I ask because I read about such an optimization in the following paper:
http://oss.csie.fju.edu.tw/~tzu98/Apache%20Hadoop%20Goes%20Realtime%20at%20Facebook.pdf

Here is the excerpt:
"For data stored in HBase that is time-series or contains a specific, 
known timestamp, a special timestamp file selection algorithm 
was added. Since time moves forward and data is rarely inserted 
at a significantly later time than its timestamp, each HFile will 
generally contain values for a fixed range of time. This 
information is stored as metadata in each HFile and queries that 
ask for a specific timestamp or range of timestamps will check if 
the request intersects with the ranges of each file, skipping those 
which do not overlap. "


This will work perfectly for my use case but I don't know if this optimization, or any other
for this use case, exists in the Apache HBase. The version of Apache HBASE we are currently
using is 0.92.1 but considering moving to 0.94. 

Thanks,
Pankaj
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message