hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frédéric Fondement <frederic.fondem...@uha.fr>
Subject Re: TTL performance
Date Mon, 25 Jun 2012 08:34:39 GMT

And thanks for your answers.

Actually, I'm already having control on my major compactions using a 
cron, at night, merely execution this bash code:
echo "status 'detailed'" | hbase shell | grep "<<table prefix>>" | awk 
-F, '{print $1}' | tr -d ' ' | sort | uniq -c | sort -nr | awk '{print 
"major_compact " sprintf( "%c", 39 ) $2 sprintf( "%c", 39 )}' | hbase 
shell >>$LOGFILE 2>&1
This lines makes it sure biggest regions are major-compacted first.

I'm not using versions.

My question was actually: given a table with millions, billions or 
whatever number of rows, how fast is the TTL handling process ? How are 
rows scanned during major compaction ? Are they all scanned in order to 
know whether they should be removed from the filesystem (be it HDFS or 
whatever else) ? Or is there any optimization making sure it can fatly 
finds those parts to be deleted ?

Best regards,


Le 21/06/2012 23:03, Andrew Purtell a écrit :
>> 2012/6/21, Frédéric Fondement<frederic.fondement@uha.fr>:
>> opt3. looks the nicest (only 3-4 tables to scan when reading), but won't my daily
major compact become crazy ?
> If you want more control over the major compaction process, for
> example to lessen the load on your production cluster to a constant
> background level, the HBase shell is the JRuby irb so you have the
> full power of the HBase API and Ruby, in the worst case you can write
> a shell script that gets a list of regions and triggers major
> compaction on each region separately or according to whatever policy
> you construct. The script invocation can happen manually or out of
> crontab.
> Another performance consideration is how many expired cells might have
> to be skipped by a scan. If you have a wide area of the keyspace that
> is all expired at once, then the scan will seem to "pause" while
> traversing this area. However, you can use setTimeRange to bound your
> scan by time range and then HBase can optimize whole HFiles away just
> by examining their metadata. Therefore I would recommend using both
> TTLs for automatic background garbage collection of expired entries,
> as well as time range bounded scans for read time optimization.
> Incidentally, there was an interesting presentation at HBaseCon
> recently regarding a creative use of timestamps:
> http://www.slideshare.net/cloudera/1-serving-apparel-catalog-from-h-base-suraj-varma-gap-inc-finalupdated-last-minute
> (slide 16).
> Best regards,
>     - Andy
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)

View raw message