accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: sharding via different tables
Date Mon, 17 Aug 2015 17:30:19 GMT
On Mon, Aug 17, 2015 at 11:36 AM, z11373 <z11373@outlook.com> wrote:
> Hi,
> We have requirement to shard by customer id. I see there are two options:
> 1. put the customer id as column family
> 2. create tables for each customer id
>
> The downside with option #1 is deleting rows only for specific customer id
> would be pretty expensive (for option #2, it's simply as deleting tables),
> and not sure if it'd be slower to scan too, though we can filter by column
> family and Accumulo is optimized for that.
>
> The downside with option #2 is when we have more customers later, we'll have
> so many tables. Current implementation needs 4 tables, so we'll end up at
> least (# of customers * 4) tables in Accumulo. Does Accumulo has limit on
> number of tables?
>
> I personally prefer option #2, but perhaps any of you had direct experiences
> with this kind of issue before, and able to share the learning.

First, regarding your question: No, Accumulo does not have any limits
on the number of tables. More tables means more stored per-table state
in ZooKeeper (and possibly more in the metadata tables), so that's
something to keep in mind, but you're not likely to run into problems
creating a few hundred tables.

Second, you have more options than that for table schemas. It really
depends on your goals, though.

Will you ever need to query data from multiple customers at once? If
not, separate tables might be an option. Since you have need for 4
tables each, you could also do one namespace per customer, each
namespace with its own set of 4 (or more) tables.

If you expect that you will ever need to query data from multiple
customers at once (or if you find it easier to manage using a fewer
number of tables), you may want to consider putting your data in a
single set of tables. You can separate data using the visibility (one
authorization per customer data set), column family (for performance,
you can create a locality group per customer/column family), or you
can segment the table by prefixing your rows with the customer so that
each customer's data is logically separate. You can also combine the
visibility/authorization strategy with another strategy, if you want
to enforce some access controls and facilitate query performance.

The biggest driving factors for your schema should really be: "How do
I expect to query this data?" and "How do I want to protect this
data?"
If you only ever query customer data separately, and you're okay with
protecting the data at the application layer (when it selects which
table to read from), then separate tables only is probably sufficient.
Otherwise, there are many more options.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

Mime
View raw message