cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clement Honore <honor...@gmail.com>
Subject Re: Help for creating a custom partitioner
Date Mon, 01 Oct 2012 08:45:58 GMT
Hi,

thanks for your answer.

We plan to use manual indexing too (with native C* indexing for other
cases).
So, for one index, we will get plenty of FK and a MultiGet call to get all
the associated entities, with RP, would then spread all the cluster.
As we don't know the cluster size yet, and as it's expected to grow at an
unknown rate, we are thinking about alternatives, now, for scalability.

But, to tell the truth, so far, we have not done performance tests.
But as the choice of a partitioner is the first C* cornerstone, we are
already thinking about a new partitioner.
We are planning tests "random vs custom partitioner" => so, my questions
for creating, first, another one.

AFAIS, your partitioner (the higher bits of the hash from hashing the
category, and the lower bits of the hash from hashing the document id) will
put all the docs of a category in (in average) 1 node. Quite interesting,
thanks!
I could add such a partitioner to my test suite.

But, why not just hashing the "category" part of the row key ?
With such partitioner, as said before, many rows on *one* node are going to
have the same hash value.
- if it hurts Cassandra behavior/performance => I am curious to know why.
Anyway, in that case, I see your partitioner, so far, as the best answer to
my wishes!
- if it's NOT hurting Cassandra behavior/performance => it sounds, then, an
optimal partitioner for our needs.

Any idea about Cassandra behavior with such hash (category-only)
partitioner ?

Regards,
Clément

2012/9/28 Tim Wintle <timwintle@gmail.com>

> On Fri, 2012-09-28 at 18:20 +0200, Clement Honore wrote:
> > Hi,****
> >
> > ** **
> >
> > I have hierarchical data.****
> >
> > I'm storing them in CF with rowkey somewhat like (category, doc id), and
> > plenty of columns for a doc definition.****
> >
> > ** **
> >
> > I have hierarchical data traversal too.****
> >
> > The user just chooses one category, and then, interact with docs
> belonging
> > only to this category.****
> >
> > ** **
> >
> > 1) If I use RandomPartitioner, all docs could be spread within all nodes
> in
> > the cluster => bad performance.****
> >
> > ** **
> >
> > 2) Using RandomPartitioner, an alternative design could be
> rowkey=category
> > and column name=(doc id, prop name)****
> >
> > I don't want it because I need fixed column names for indexing purposes,
> > and the "category" is quite a lonnnng string.****
> >
> > ** **
> >
> > 3) Then, I want to define a new partitioner for my rowkey (category, doc
> > id), doing MD5 only for the "category" part.****
> >
> > ** **
> >
> > The question is : with such partitioner, many rows on *one* node are
> going
> > to have the same MD5 value, as a result of this new partitioner.****
>
> If you do decide writing having rows on the same node is what you want,
> then you could take the higher bits of the hash from hashing the
> category, and the lower bits of the hash from hashing the document id.
>
> That would mean documents in a category would be close to each other in
> the ring - while being unlikely to share the same hash.
>
>
> However, If you're doing this then all reads/writes to the category are
> going to be to a single machine. That's not going to spread the load
> across the cluster very well as I assume a few categories are going to
> be far more popular than others.
>
> Have you tested that you actually get bad performance from
> RandomPartitioner?
>
> Tim
>
>

Mime
View raw message