cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Brosius <>
Subject Re: RandomPartitioner is providing a very skewed distribution of keys across a 5-node Solandra cluster
Date Sun, 24 Jun 2012 18:21:15 GMT
Well it sounds like this doesn't apply to you.

if you had set up your column family in cql as .... PRIMARY KEY 
(domain_name, path).... or something like that and where looking at lots 
and lots of url pages (domain_name + path), but from a very small number 
domain_names, then the partitioner being just the domain_name could 
account for an uneven distribution.

But it sounds like your key is just a URL so that should (in theory) be 

On 06/24/2012 01:53 PM, Safdar Kureishy wrote:
> Hi Dave,
> Would you mind elaborating a bit more on that, preferably with an 
> example? AFAIK, Solandra uses the unique id of the Solr document as 
> the input for calculating the md5 hash for shard/node assignment. In 
> this case the ids are just millions of varied web URLs that do /not/ 
> adhere to any regular expression. I'm not sure if that answers your 
> question below?
> Thanks,
> Safdar
> On Sun, Jun 24, 2012 at 8:38 PM, Dave Brosius 
> < <>> wrote:
>     If i read what you are saying, you are _not_ using composite keys?
>     That's one thing that could do it, if the first part of the
>     composite key had a very very low cardinality.
>     On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
>>     Hi,
>>     I've searched online but was unable to find any leads for the
>>     problem below. This mailing list seemed the most appropriate
>>     place. Apologies in advance if that isn't the case.
>>     I'm running a 5-node Solandra cluster (Solr + Cassandra). I've
>>     setup the nodes with tokens /evenly distributed across the token
>>     space/, for a 5-node cluster (as evidenced below under the
>>     "effective-ownership" column of the "nodetool ring" output). My
>>     data is a set of a few million crawled web pages, crawled using
>>     Nutch, and also indexed using the "solrindex" command available
>>     through Nutch. AFAIK, the key for each document generated from
>>     the crawled data is the URL.
>>     Based on the "load" values for the nodes below, despite adding
>>     about 3 million web pages to this index via the HTTP Rest API
>>     (e.g.: http://9.9.9.x:8983/solandra/index/update....), some nodes
>>     are still "empty". Specifically, nodes and have
>>     just a few kilobytes (shown in *bold* below) of the index, while
>>     the remaining 3 nodes are consistently getting hammered by all
>>     the data. If the RandomPartioner (which is what I'm using for
>>     this cluster) is supposed to achieve an even distribution of keys
>>     across the token space, why is it that the data below is skewed
>>     in this fashion? Literally, no key was yet been hashed to the
>>     nodes and below. Could someone possibly shed some
>>     light on this absurdity?.
>>     [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
>>     Address         DC          Rack        Status State   Load      
>>          Effective-Owership  Token
>>                              136112946768375385385349842972707284580
>>       datacenter1 rack1       Up     Normal  7.57 GB    
>>         20.00%              0
>>       datacenter1 rack1       Up     Normal *21.44 KB*  
>>          20.00%              34028236692093846346337460743176821145
>>       datacenter1 rack1       Up     Normal  14.99 GB    
>>        20.00%              68056473384187692692674921486353642290
>>       datacenter1 rack1       Up     Normal *50.79 KB*  
>>          20.00%              102084710076281539039012382229530463435
>>       datacenter1 rack1       Up     Normal  15.22 GB    
>>        20.00%              136112946768375385385349842972707284580
>>     Thanks in advance.
>>     Regards,
>>     Safdar

View raw message