incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Safdar Kureishy <>
Subject Re: RandomPartitioner is providing a very skewed distribution of keys across a 5-node Solandra cluster
Date Sun, 24 Jun 2012 19:12:25 GMT
Oh, I forgot to mention that I'm using cassandra case that
question comes up.
Hoping someone can offer some more feedback on the likelyhood of this
behavior ...
Thanks again,
On Jun 24, 2012 9:22 PM, "Dave Brosius" <> wrote:

>  Well it sounds like this doesn't apply to you.
> if you had set up your column family in cql as .... PRIMARY KEY
> (domain_name, path).... or something like that and where looking at lots
> and lots of url pages (domain_name + path), but from a very small number
> domain_names, then the partitioner being just the domain_name could account
> for an uneven distribution.
> But it sounds like your key is just a URL so that should (in theory) be
> fine.
> On 06/24/2012 01:53 PM, Safdar Kureishy wrote:
> Hi Dave,
>  Would you mind elaborating a bit more on that, preferably with an
> example? AFAIK, Solandra uses the unique id of the Solr document as the
> input for calculating the md5 hash for shard/node assignment. In this case
> the ids are just millions of varied web URLs that do *not* adhere to any
> regular expression. I'm not sure if that answers your question below?
>  Thanks,
> Safdar
> On Sun, Jun 24, 2012 at 8:38 PM, Dave Brosius <>wrote:
>>  If i read what you are saying, you are _not_ using composite keys?
>> That's one thing that could do it, if the first part of the composite key
>> had a very very low cardinality.
>> On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
>>  Hi,
>>  I've searched online but was unable to find any leads for the problem
>> below. This mailing list seemed the most appropriate place. Apologies in
>> advance if that isn't the case.
>>  I'm running a 5-node Solandra cluster (Solr + Cassandra). I've setup
>> the nodes with tokens *evenly distributed across the token space*, for a
>> 5-node cluster (as evidenced below under the "effective-ownership" column
>> of the "nodetool ring" output). My data is a set of a few million crawled
>> web pages, crawled using Nutch, and also indexed using the "solrindex"
>> command available through Nutch. AFAIK, the key for each document generated
>> from the crawled data is the URL.
>>  Based on the "load" values for the nodes below, despite adding about 3
>> million web pages to this index via the HTTP Rest API (e.g.:
>> http://9.9.9.x:8983/solandra/index/update....), some nodes are still
>> "empty". Specifically, nodes and have just a few kilobytes
>> (shown in *bold* below) of the index, while the remaining 3 nodes are
>> consistently getting hammered by all the data. If the RandomPartioner
>> (which is what I'm using for this cluster) is supposed to achieve an even
>> distribution of keys across the token space, why is it that the data below
>> is skewed in this fashion? Literally, no key was yet been hashed to the
>> nodes and below. Could someone possibly shed some light on
>> this absurdity?.
>>  [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
>> Address         DC          Rack        Status State   Load
>>  Effective-Owership  Token
>>                  136112946768375385385349842972707284580
>>       datacenter1 rack1       Up     Normal  7.57 GB
>> 20.00%              0
>>       datacenter1 rack1       Up     Normal  *21.44 KB*
>>  20.00%              34028236692093846346337460743176821145
>>       datacenter1 rack1       Up     Normal  14.99 GB
>>  20.00%              68056473384187692692674921486353642290
>>       datacenter1 rack1       Up     Normal  *50.79 KB*
>>  20.00%              102084710076281539039012382229530463435
>>       datacenter1 rack1       Up     Normal  15.22 GB
>>  20.00%              136112946768375385385349842972707284580
>>  Thanks in advance.
>>  Regards,
>>  Safdar

View raw message