cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiller, Dean" <Dean.Hil...@nrel.gov>
Subject Re: Help for creating a custom partitioner
Date Mon, 01 Oct 2012 12:13:12 GMT
I would be surprised if random partitioner hurt your performance.  In general, doing performance
tests on a 6 node cluster with PlayOrm Scalable SQL, even joins queries ended up faster as
the parallel disks of reading all the rows was way faster than reading from a single machine(remember,
one disk bottleneck can really hurt which is why random partitioner works out so well).

Later,
Dean

From: Clement Honore <honore.c@gmail.com<mailto:honore.c@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Monday, October 1, 2012 2:45 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Help for creating a custom partitioner

Hi,

thanks for your answer.

We plan to use manual indexing too (with native C* indexing for other cases).
So, for one index, we will get plenty of FK and a MultiGet call to get all the associated
entities, with RP, would then spread all the cluster.
As we don't know the cluster size yet, and as it's expected to grow at an unknown rate, we
are thinking about alternatives, now, for scalability.

But, to tell the truth, so far, we have not done performance tests.
But as the choice of a partitioner is the first C* cornerstone, we are already thinking about
a new partitioner.
We are planning tests "random vs custom partitioner" => so, my questions for creating,
first, another one.

AFAIS, your partitioner (the higher bits of the hash from hashing the category, and the lower
bits of the hash from hashing the document id) will put all the docs of a category in (in
average) 1 node. Quite interesting, thanks!
I could add such a partitioner to my test suite.

But, why not just hashing the "category" part of the row key ?
With such partitioner, as said before, many rows on *one* node are going to have the same
hash value.
- if it hurts Cassandra behavior/performance => I am curious to know why. Anyway, in that
case, I see your partitioner, so far, as the best answer to my wishes!
- if it's NOT hurting Cassandra behavior/performance => it sounds, then, an optimal partitioner
for our needs.

Any idea about Cassandra behavior with such hash (category-only) partitioner ?

Regards,
Clément

2012/9/28 Tim Wintle <timwintle@gmail.com<mailto:timwintle@gmail.com>>
On Fri, 2012-09-28 at 18:20 +0200, Clement Honore wrote:
> Hi,****
>
> ** **
>
> I have hierarchical data.****
>
> I'm storing them in CF with rowkey somewhat like (category, doc id), and
> plenty of columns for a doc definition.****
>
> ** **
>
> I have hierarchical data traversal too.****
>
> The user just chooses one category, and then, interact with docs belonging
> only to this category.****
>
> ** **
>
> 1) If I use RandomPartitioner, all docs could be spread within all nodes in
> the cluster => bad performance.****
>
> ** **
>
> 2) Using RandomPartitioner, an alternative design could be rowkey=category
> and column name=(doc id, prop name)****
>
> I don't want it because I need fixed column names for indexing purposes,
> and the "category" is quite a lonnnng string.****
>
> ** **
>
> 3) Then, I want to define a new partitioner for my rowkey (category, doc
> id), doing MD5 only for the "category" part.****
>
> ** **
>
> The question is : with such partitioner, many rows on *one* node are going
> to have the same MD5 value, as a result of this new partitioner.****

If you do decide writing having rows on the same node is what you want,
then you could take the higher bits of the hash from hashing the
category, and the lower bits of the hash from hashing the document id.

That would mean documents in a category would be close to each other in
the ring - while being unlikely to share the same hash.


However, If you're doing this then all reads/writes to the category are
going to be to a single machine. That's not going to spread the load
across the cluster very well as I assume a few categories are going to
be far more popular than others.

Have you tested that you actually get bad performance from
RandomPartitioner?

Tim



Mime
View raw message