hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billy Watson <williamrwat...@gmail.com>
Subject Re: Full table scan cost after deleting Millions of Records from HBase Table
Date Wed, 10 Feb 2016 01:16:01 GMT
If most queries are going to scan the entire table, I'm not sure hbase is
the right solution for you. One of the advantages of HBase, in my opinion,
is putting data in such a format that you can do skip-scans where lots of
data is never read during a particular query.

If you're deleting so much and scanning so much, you might be better off
with hadoop flat files or some other hadoop tech like Hive, IMO. An
alternative to really take advantage of HBase is to think of a primary key
which would limit your scans.

But, addressing your questions directly:

1. I *think* region splits and compactions would be fine, but it really
depends on your inserts and deletes. Are your inserts and deletes
randomized amongst the key space or are they hot-spotty? You'll definitely
have to do some thinking on what your rowkey should be. Major compactions
will take awhile to process all the deletes, however.
2. No, I would think the query speed wouldn't be much faster, but I'm not
familiar with every optimization in the underlying HBase code. Yes, though,
major compactions would speed things up after the deletes.

William Watson
Lead Software Engineer

On Tue, Feb 9, 2016 at 7:01 PM, houman <baba.opensource@gmail.com> wrote:

> Hi
> I'm thinking of creating a table that will have millions of rows; and each
> day, I would insert and delete millions of rows to/from it.
> Two questions:
> 1. I'm guessing HBase won't have any problems with this approach, but just
> wanted to check that in terms of region-splits or compaction I won't run
> into issues.  Can you think of any problems?
> 2. Let's say there are 6 million records in the table, then do a full
> table-scan querying a column-family that has a single family the value in
> the cell is either 1 or 0.  Let's say it takes N seconds.  Now I bulk
> delete
> 5 million records (but do not run  compaction) and run the same query
> again,
> would I get a much faster response or will HBase need to perform the same
> amount of i/o (as if there are still 6 million records there).  Once
> compaction is done, then the query would run faster...
> Also most queries on the table would scan the entire table.
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Full-table-scan-cost-after-deleting-Millions-of-Records-from-HBase-Table-tp4077676.html
> Sent from the HBase User mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message