hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barney Frank <barneyfran...@gmail.com>
Subject Re: hbase evaluation questions
Date Wed, 14 Jul 2010 14:02:17 GMT
I am in the SaaS myself and, like salesforce.com, store all clients' data in
the same tables.  The approach is to assign each client a unique id and use
it as a prefix for the row id.  That way clients cannot somehow see other
client's data since all queries include the client id in them.  Hence
everything is not mixed together, but each client has dedicated rows in a
shared table.

My two cents.

On Wed, Jul 14, 2010 at 5:56 AM, Wayne <wav100@gmail.com> wrote:

> We are a SaaS provider and we want to move to a more shared model (vs. 1
> mysql server per client) but have concerns about going to a completely
> mixed/shared model where everything is mixed together. It is a
> physiological
> leap we are not really ready to make and as a paid for SaaS business we
> need
> to retain some level of separation between clients' data contractually
> (separate backups etc.).
>
> As far as tall vs. wide I see how tall can be beneficial and will work the
> best. To me that means hbase is not really a column based data store as
> there is no way to efficiently access the millions in the billions of rows.
>
> Thanks.
>
>
> On Wed, Jul 14, 2010 at 12:32 PM, Angus He <angushe@gmail.com> wrote:
>
> > > 1) How can hbase be configured for a multi-tenancy model? What are the
> > > options to create a solid separation of data? In a relational database
> > > schemas would provide this and in cassandra the keyspace can provide
> the
> > > same. Of course we can add the tenancy key to the row key and create
> > tenant
> > > specific tables/column families but that does not provide the same
> level
> > of
> > > confidence of separation. We could also create separate clusters for
> each
> > > client, but then that defeats part of the point of going to a
> distributed
> > > database cluster to improve overall throughput+utilization across all
> > > clients. We currently run single MySQL databases for each of our
> clients
> > > (1-3 TBs each).
> >
> > I was just wondering why you need to separate the analytical data into
> > different tables or hbase instances.
> > Data reliability or security?
> > By the way, In the bigtable paper, Google mentioned that they packed
> > data of all web sites into two tables.
> > raw click table for the end-user session, and summary table for summary
> > data.
> > Actually, we also did, and it works all right so far.
> >
> > > 2) I am trying to model data within hbase and I am unable to truly
> model
> > it
> > > as a column based data store due to the limitations of the API
> > > (hbase.thrift) in terms of getting back data for certain columns. I see
> > > information for defining a bloom filter which I believe could help
> speed
> > up
> > > the retrieval of certain columns within a large row but the API does
> not
> > > seem to offer the ability to iterate through the columns. The API
> > supports
> > > the ability to request a list of columns but no way that I have seen to
> > scan
> > > columns for a given row key based on a start/stop column. This forces
> us
> > to
> > > create a tall data model vs. a wide data model which in the end we
> think
> > > will hurt performance as more rows will be required.
> >
> > Hbase has no built-in support for column range scan.
> > But you can roll out an implementation of your own based on the
> > versatile Hbase filter mechanism.
> > You probably do not need column range scan support at all.
> >
> > In my opinion, tall table is more efficient.
> >
> > 1. fat table probably need to process more data to get the same result.
> >
> > tall table:
> > row1: foobar-date-1
> > row2: foobar-date-2
> > ...
> > row 1000: foobar-date-1000
> >
> > fat table:
> > row1: foobar   columns: data-1, date2,....., data1000
> >
> > Assume you want to retrieve data between data50 and data 100.
> > In the case of tall table, just set the scan start key:
> > foobar-data-50, and the end key: foobar-data-100, only 50 keyvalue
> > items are touched.
> > But for fat table, you have to skip the first 49 columns, date-1 -
> > date-49, then stop at the column data-100, 100 keyvalue items
> > involved. It will not be true if HBase supports the seek operation
> > some day.
> >
> > 2. more flexiable granularity when parallel query are employed.
> >
> > > The data model is a std star schema in relational terms with a time
> > > dimension. Time is only down to the daily granularity and we would
> prefer
> > to
> > > have this be part of the column key instead of the row key. From all
> > > examples I have seen time has always been added to the end of the row
> key
> > to
> > > be accessed via row scans. In Cassandra for example time is modeled as
> a
> > > super column or column composite index and the API supports a range get
> > > against a set of columns within a single row.
> > >
> > > Any advice or pointers would be greatly appreciated. Thanks in advance!
> > >
> > > Wayne
> > >
> >
> >
> >
> > --
> > Regards
> > Angus
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message