hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: TIMERANGE performance on uniformly distributed keyspace
Date Sat, 14 Apr 2012 18:04:54 GMT

Thanks N!  That's a good point.  I'll update the RefGuide with that.

So if the data is evenly distributed (and evenly old per HFile) you still
have the same problem, but it's conceivable that could not be the case.
This is a case where monotonically increasing keys would actually help you.





On 4/14/12 11:57 AM, "N Keywal" <nkeywal@gmail.com> wrote:

>Hi,
>
>For the filtering part, every HFile is associated to a set of meta data.
>This meta data includes the timerange. So if there is no overlap between
>the time range you want and the time range of the store, the HFile is
>totally skipped.
>
>This work is done in StoreScanner#selectScannersFrom
>
>Cheers,
>
>N.
>
>
>On Sat, Apr 14, 2012 at 5:11 PM, Doug Meil
><doug.meil@explorysmedical.com>wrote:
>
>> Hi there-
>>
>> With respect to:
>>
>> "* Does it need to hit every memstore and HFile to determine if there
>> isdata available? And if so does it need to do a full scan of that file
>>to
>> determine the records qualifying to the timerange, since keys are stored
>> lexicographically?"
>>
>> And...
>>
>> "Using "scan 'table', {TIMERANGE => [t, t+x]}" :"
>> See...
>>
>>
>> http://hbase.apache.org/book.html#regions.arch
>> 8.7.5.4. KeyValue
>>
>>
>>
>> The timestamp is an attribute of the KeyValue, but unless you perform a
>> restriction using start/stop row it have to process every row.
>>
>> Major compactions don't change this fact, they just change the number of
>> HFiles that have to get processed.
>>
>>
>>
>> On 4/14/12 10:38 AM, "Rob Verkuylen" <rob@verkuylen.net> wrote:
>>
>> >I'm trying to find a definitive answer to the question if scans on
>> >timerange alone will scale when you use uniformly distributed keys like
>> >UUIDs.
>> >
>> >Since the keys are randomly generated that would mean the keys will be
>> >spread out over all RegionServers, Regions and HFiles. In theory,
>>assuming
>> >enough writes, that would mean that every HFile will contain the entire
>> >timerange of writes.
>> >
>> >Now before a major compaction, data is in the memstores and (non
>> >max.filesize) flushed&merged HFiles. I can imagine that a scan using a
>> >TIMERANGE can quickly serve from memstores and the smaller files, but
>>how
>> >does it perform after a major compaction?
>> >
>> >Using "scan 'table', {TIMERANGE => [t, t+x]}" :
>> >* How does HBase handle this query in this case(UUIDs)?
>> >* Does it need to hit every memstore and HFile to determine if there is
>> >data available? And if so does it need to do a full scan of that file
>>to
>> >determine the records qualifying to the timerange, since keys are
>>stored
>> >lexicographically?
>> >
>> >I've run some tests on 300+ region tables, on month old data(so after
>> >major
>> >compaction) and performance/response seems fairly quick. But I'm
>>trying to
>> >understand why that is, because hitting every HFile on every region
>>seems
>> >to be ineffective. Lars' book figure 9-3 seems to indicate this as
>>well,
>> >but cant seem to get the answer from the book or anywhere else.
>> >
>> >Thnx, Rob
>>
>>
>>



Mime
View raw message