hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Durfey <sjdur...@gmail.com>
Subject Re: Two questions about the maximum number of versions of a column family
Date Sun, 21 Feb 2016 16:20:25 GMT
Someone please correct me if I am wrong. 
I've looked into this recently due to some performance reasons with my tables in a production
environment. Like the books says, I don't recommend keeping this many versions around unless
you really need them. Telling HBase to keep around a very large number doesn't waste space,
that's just a value in the table descriptor. So, I wouldn't worry about that. The problems
are going to come in when you actually write out those versions. 
My tables currently have max_versions set and roughly 40% of the tables are due to historical
versions. So, one table in particular is around 25 TB. I don't have a need to keep this many
versions, so I am working on changing the max versions to the default of 3 (some cells are
hundreds or thousands of cells deep). The issue youll run into is scan performance on the
region servers needing to scan over all that data you may not need (due to large store files).
This could lead to increased scan time and potentially scanner timeouts, depending upon how
large your batch size is set on the scan. 
I assume this has some performance impact on compactions, both minor and major, but I didn't
investigate that, and potentially on the write path, but also not something I looked into. 
Changing the number of versions after the table has been created doesn't have a performance
impact due to just being a metadata change. The table will need to be disabled, changed, and
re-enabled again. If this is done through a script the table could be offline for a couple
of seconds. The only concern around that are users of the table. If they have scheduled job
runs that hit that table that would break if they try to read from it while the table is disabled.
The only performance impact I can think of around this change would be major compaction of
the table, but even that shouldn't be an issue. 

From: Daniel <daniel@abde.me>
Sent: Sunday, February 21, 2016 9:22 AM
Subject: Two questions about the maximum number of versions of a column family
To: user <user@hbase.apache.org>

Hi, I have two questions about the maximum number of versions of a column family:

(1) Is it OK to set a very large (>100,000) maximum number of versions for a column family?

The reference guide says "It is not recommended setting the number of max versions to an exceedingly
high level (e.g., hundreds or more) unless those old values are very dear to you because this
will greatly increase StoreFile size." (Chapter 36.1)

I'm new to the Hadoop ecosystem, and have no idea about the consequences of a very large StoreFile

Furthermore, it is OK to set a large maximum number of versions but insert only a few versions?
Does it waste space?

(2) How much performance overhead does it cause to increase the maximum number of versions
of a column family after enormous (e.g. billions) rows have been inserted?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message