hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikael Sitruk <mikael.sit...@gmail.com>
Subject Re: Major Compaction Concerns
Date Sun, 19 Feb 2012 22:05:04 GMT
A follow-up...
1. my CF were already working with BF they used ROWCOL, (i didn't pay
attention to that at the time i wrote my answers)
2. I see form the logs that the BF is already 100% - is it bad? should I
had more memory for BF?
3. HLog compression (HBASE-4608) is not scheduled yet, is it by intention?
4. Compaction.ratio is only for 0.92.x releases, so i cannot use it yet.
5. all other patches are also for 0.92/0.94 so my situation will not be
better till then, beside playing with the log rolling size, and max number
of store files
6. I have also noticed that in a workload of pure insert (no read, empty
regions, new keys) the store files on the RS can reach more than 4500
files, nevertheless with a update/read scenario the store files were not
passing 1500 files per region (the throttling of the flush was active and
not in insert) Is there an explanation for that?
7. I also have a 0.92 fresh install, and checking there the behavior
(additional result soon, hopefully)


On Sat, Jan 14, 2012 at 11:30 PM, Mikael Sitruk <mikael.sitruk@gmail.com>wrote:

> Wow, thank you very much for all those precious explanations, pointers and
> examples. It's a lot to ingest... I will try them (at least what i can with
> 0.90.4 (yes i'm upgrading from 0.90.1 to 0.90.4)) and keep you informed.
> BTW I'm already using compression (GZ), the current data is randomized so
> I don't have so much gain as you mentioned ( i think i'm around 30% only).
> It seems that BF is one of the major thing i need to look up with the
> compaction.ratio, and i need a different setting for my different CF. (one
> CF has small set of column and each update will change 50% --> ROWCOL, the
> second CF has always a new column per update --> ROW)
> I'm not keeping more than one version neither, and you wrote this is not a
> point query.
> A suggestion is perhaps to take all those example/explanation and add them
> to the book for future reference.
> Regards,
> Mikael.S
> On Sat, Jan 14, 2012 at 4:06 AM, Nicolas Spiegelberg <nspiegelberg@fb.com>wrote:
>> >I'm sorry but i don't understand, of course i have a disk and network
>> >saturation and the flush stop to flush because he is waiting for
>> >compaction
>> >to finish. Since this a major compaction was triggered - all the
>> >stores (large number)  present on the disks (7 disk per RS) will be
>> >grabbed
>> >for major compact, and the I/O is affected. Network is also affected
>> since
>> >all are major compacting at the same time and replicating files on same
>> >time (1GB network).
>> When you have an IO problem, there are multiple pieces at play that you
>> can adjust:
>> Write: HLog, Flush, Compaction
>> Read: Point Query, Scan
>> If your writes are far more than your reads, then you should relax one of
>> the write pieces.
>> - HLog: You can't really adjust HLog IO outside of key compression
>> (HBASE-4608)
>> - Flush: You can adjust your compression.  None->LZO == 5x compression.
>> LZO->GZ == 2x compression.  Both are at the expense of CPU.  HBASE-4241
>> minimizes flush IO significantly in the update-heavy use case (discussed
>> this in the last email).
>> - Compaction: You can lower the compaction ratio to minimize the amount of
>> rewrites over time.  That's why I suggested changing the ratio from 1.2 ->
>> 0.25.  This gives a ~50% IO reduction (blog post on this forthcoming @
>> http://www.facebook.com/UsingHBase ).
>> However, you may have a lot more reads than you think.  For example, let's
>> say read:write ratio is 1:10, so significantly read dominated.  Without
>> any of the optimizations I listed in the previous email, your real read
>> ratio is multiplied by the StoreFile count (because you naively read all
>> StoreFiles).  So let say, during congestion, you have 20 StoreFiles.
>> 1*20:10 means that you're now 2:1 read dominated.  You need features to
>> reduce the number of StoreFiles you scan when the StoreFile count is high.
>> - Point Query: bloom filters (HBASE-1200, HBASE-2794), lazy seek
>> (HBASE-4465), and seek optimizations (HBASE-4433, HBASE-4434, HBASE-4469,
>> HBASE-4532)
>> - Scan: not as many optimizations here.  Mostly revolve around proper
>> usage & seek-next optimization when using filters. Don't have JIRA numbers
>> here, but probably half-dozen small tweaks were added to 0.92.
>> >I don't have an increment workload (the workload either update columns on
>> >a
>> >CF or add column on a CF for the same key), so how those patch will help?
>> Increment & read->update workload end up roughly picking up the same
>> optimizations.  Adding a column to an existing row is no different than
>> adding a new row as far as optimizations are concerned because there's
>> nothing to de-dupe.
>> >I don't say this is a bad thing, this is just an observation from our
>> >test,
>> >HBase will slow down the flush in case too many store file are present,
>> >and
>> >will add pressure on GC and memory affecting performance.
>> >The update workload does not send all the row content for a certain key
>> so
>> >only partial data is written, in order to get all the row i presume that
>> >reading the newest Store is not enough ("all" stores need to be read
>> >collecting the more up to date field a rebuild a full row), or i'm
>> missing
>> >something?
>> Reading all row columns is the same as doing a scan.  You're not doing a
>> point query if you don't specify the exact key (columns) you're looking
>> for.  Setting versions to unlimited, then getting all versions of a
>> particular ROW+COL would also be considered a scan vs a point query as far
>> as optimizations are concerned.
>> >1. If i did not set a specific property for bloom filter (BF), does it
>> >means that i'm not using them (the book only refer to BF with regards to
>> >CF)?
>> By default, bloom filters are disabled, so you need to enable them to get
>> the optimizations.  This is by design.  Bloom Filters trade off cache
>> space for low-overhead probabilistic queries.  Default is 8-bytes per
>> bloom entry (key) & 1% false positive rate.  You can use 'bin/hbase
>> org.apache.hadoop.hbase.io.hfile.HFile' (look at help, then -f to specify
>> a StoreFile and then use -m for meta info) to see your StoreFile's average
>> KV size.  If size(KV) == 100 bytes, then blooms use 8% of the space in
>> cache, which is better than loading the StoreFile block only to get a
>> miss.
>> Whether to use a ROW or ROWCOL bloom filter depends on your write & read
>> pattern.  If you read the entire row at a time, use a ROW bloom.  If you
>> point query, ROW or ROWCOL are both options.  If you write all columns for
>> a row at the same time, definitely use a ROW bloom.  If you have a small
>> column range and you update them at different rates/times, then a ROWCOL
>> bloom filter may be more helpful.  ROWCOL is really useful if a scan query
>> for a ROW will normally return results, but a point query for a ROWCOL may
>> have a high miss rate.  A perfect example is storing unique hash-values
>> for a user on disk.  You'd use 'user' as the row & the hash as the column.
>>  Most instances, the hash won't be a duplicate, so a ROWCOL bloom would be
>> better.
>> >3. How can we ensure that compaction will not suck too much I/O if we
>> >cannot control major compaction?
>> TCP Congestion Control will ensure that a single TCP socket won't consume
>> too much bandwidth, so that part of compactions is automatically handled.
>> The part that you need to handle is the number of simultaneous TCP sockets
>> (currently 1 until multi-threaded compactions) & the aggregate data volume
>> transferred over time.  As I said, this is controlled by compaction.ratio.
>>  If temporary high StoreFile counts cause you to bottleneck, slight
>> latency variance is an annoyance of the current compaction algorithm but
>> the underlying problem you should be looking at solving is the system's
>> inability to filter out the unnecessary StoreFiles.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message