hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Durfey <sjdur...@gmail.com>
Subject Re: Two questions about the maximum number of versions of a column family
Date Sun, 21 Feb 2016 18:28:51 GMT
I personally don't deal with time series data, so I'm not going to make a statement on which
is better. I would think from a scanning viewpoint putting the time stamp in the row key is
easier, but that will introduce scanning performance bottlenecks due to the row keys being
stored lexicographically. All data from the same date range will end up in the same region
or regions (this is causes hot spots) reducing the number of tasks you get for reads, thus
increasing extraction time. 
One method to deal with this is salting your row keys to get an even distribution of data
around the cluster. Cloudera recently had a good post about this on their blog: http://blog.cloudera.com/blog/2015/06/how-to-scan-salted-apache-hbase-tables-with-region-specific-key-ranges-in-mapreduce/

On Sun, Feb 21, 2016 at 9:47 AM -0800, "Daniel" <daniel@abde.me> wrote:

Thanks for your sharing, Stephen and Ted. The reference guide recommends "rows" over "versions"
concerning time series data. Are there advantages of using "reversed timestamps" in row keys
over the built-in "versions" with regard to scanning performance?

------------------ Original ------------------
From: "Ted Yu"
Date: Mon, Feb 22, 2016 01:02 AM
To: "user@hbase.apache.org";
Subject: Re: Two questions about the maximum number of versions of a column family

Thanks for sharing, Stephen.

bq. scan performance on the region servers needing to scan over all that
data you may not need

When number of versions is large, try to utilize Filters (where
appropriate) which implements:

  public Cell getNextCellHint(Cell currentKV) {

See MultiRowRangeFilter for example.

Please see hbase-shell/src/main/ruby/shell/commands/alter.rb for syntax on
how to alter table. When "hbase.online.schema.update.enable" is true, table
can stay online during the change.


On Sun, Feb 21, 2016 at 8:20 AM, Stephen Durfey  wrote:

> Someone please correct me if I am wrong.
> I've looked into this recently due to some performance reasons with my
> tables in a production environment. Like the books says, I don't recommend
> keeping this many versions around unless you really need them. Telling
> HBase to keep around a very large number doesn't waste space, that's just a
> value in the table descriptor. So, I wouldn't worry about that. The
> problems are going to come in when you actually write out those versions.
> My tables currently have max_versions set and roughly 40% of the tables
> are due to historical versions. So, one table in particular is around 25
> TB. I don't have a need to keep this many versions, so I am working on
> changing the max versions to the default of 3 (some cells are hundreds or
> thousands of cells deep). The issue youll run into is scan performance on
> the region servers needing to scan over all that data you may not need (due
> to large store files). This could lead to increased scan time and
> potentially scanner timeouts, depending upon how large your batch size is
> set on the scan.
> I assume this has some performance impact on compactions, both minor and
> major, but I didn't investigate that, and potentially on the write path,
> but also not something I looked into.
> Changing the number of versions after the table has been created doesn't
> have a performance impact due to just being a metadata change. The table
> will need to be disabled, changed, and re-enabled again. If this is done
> through a script the table could be offline for a couple of seconds. The
> only concern around that are users of the table. If they have scheduled job
> runs that hit that table that would break if they try to read from it while
> the table is disabled. The only performance impact I can think of around
> this change would be major compaction of the table, but even that shouldn't
> be an issue.
>     _____________________________
> From: Daniel 
> Sent: Sunday, February 21, 2016 9:22 AM
> Subject: Two questions about the maximum number of versions of a column
> family
> To: user 
> Hi, I have two questions about the maximum number of versions of a column
> family:
> (1) Is it OK to set a very large (>100,000) maximum number of versions for
> a column family?
> The reference guide says "It is not recommended setting the number of max
> versions to an exceedingly high level (e.g., hundreds or more) unless those
> old values are very dear to you because this will greatly increase
> StoreFile size." (Chapter 36.1)
> I'm new to the Hadoop ecosystem, and have no idea about the consequences
> of a very large StoreFile size.
> Furthermore, it is OK to set a large maximum number of versions but insert
> only a few versions? Does it waste space?
> (2) How much performance overhead does it cause to increase the maximum
> number of versions of a column family after enormous (e.g. billions) rows
> have been inserted?
> Regards,
> Daniel

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message