hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-10742) Data temperature aware compaction policy
Date Thu, 13 Mar 2014 21:45:47 GMT
Andrew Purtell created HBASE-10742:

             Summary: Data temperature aware compaction policy
                 Key: HBASE-10742
                 URL: https://issues.apache.org/jira/browse/HBASE-10742
             Project: HBase
          Issue Type: Brainstorming
            Reporter: Andrew Purtell

Reading "Identifying Hot and Cold Data in Main-Memory Databases" (Levandoski, Larson, and
Stoica), it occurred to me that some of the motivation applies to HBase and some of the results
can inform a data temperature aware compaction policy implementation.

We also wish to optimize retention of cells in the working set in memory, in blockcache. 

We can also consider further and related performance optimizations in HBase that awareness
of hot and cold data can enable, even for cases where the working set does not fit in memory.
If we could partition HFiles into hot and cold (cold+lukewarm) and move cells between them
at compaction time, then we could:

- Migrate hot HFiles onto alternate storage tiers with improved read latency and throughput
characteristics. This has been discussed before on HBASE-6572. Or, migrate cold HFiles to
an archival tier.

- Preload hot HFiles into blockcache to increase cache hit rates, especially when regions
are first brought online. And/or add another LRU priority to increase the likelihood of retention
of blocks in hot HFiles. This could be sufficiently different from ARC to avoid issues there.

- Reduce the compaction priorities of cold HFiles, with proportional reduction in priority
IO and write amplification, since cold files would less frequently participate in reads.

Levandoski et. al. describe determining data temperature with low overhead using an out of
band estimation process running in the background over an access log. We could consider logging
reads along with mutations and similarly process the result in the background. The WAL could
be overloaded to carry access log records, or we could follow the approach described in the
paper and maintain an in memory access log only. 

We chose the offline approach for several reasons. First, as mentioned earlier, the overhead
of even the simplest caching scheme is very high. Second, the offline approach is generic
and requires minimum changes to the database engine. Third, logging imposes very little overhead
during normal operation. Finally, it allows flexibility in when, where, and how to analyze
the log and estimate access frequencies. For instance, the analysis can be done on a separate
machine, thus reducing overhead on the system running the transactional workloads.

Importantly, they only log a sample of all accesses.

To implement sampling, we have each worker thread flip a biased coin before starting a new
query (where bias correlates with sample rate). The thread records its accesses in log buffers
(or not) based on the outcome of the coin flip. In Section V, we report experimental results
showing that sampling 10% of the accesses reduces the accuracy by only 2.5%,

Likewise we would only record a subset of all accesses to limit overheads.

The offline process estimates access frequencies over discrete time slices using exponential
smoothing. (Markers representing time slice boundaries are interleaved with access records
in the log.) Forward and backward classification algorithms are presented. The forward algorithm
requires a full scan over the log and storage proportional to the number of unique cell addresses,
while the backward algorithm requires reading a least the tail of the log in reverse order.

If we overload the WAL to carry the access log, offline data temperature estimation can piggyback
as a WAL listener. If replication is also enabled then we might even consolidate both forward
scans through the WAL into a single sequential read. The forward algorithm would then be a
natural choice. The HBase master is fairly idle most of the time and less memory hungry as
a regionserver, at least in today's architecture. We could probably get away with considering
only row+family as a unique coordinate to minimize space overhead.  Or if instead we maintain
the access logs in memory at the RegionServer, then there is a parallel formulation and we
could benefit from the reverse algorithm's ability to terminate early once confidence bounds
are reached and backwards scanning IO wouldn't be a concern. This handwaves over a lot of

This message was sent by Atlassian JIRA

View raw message