incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Safdar Kureishy <safdar.kurei...@gmail.com>
Subject Re: RandomPartitioner is providing a very skewed distribution of keys across a 5-node Solandra cluster
Date Sun, 24 Jun 2012 17:53:43 GMT
Hi Dave,

Would you mind elaborating a bit more on that, preferably with an example?
AFAIK, Solandra uses the unique id of the Solr document as the input for
calculating the md5 hash for shard/node assignment. In this case the ids
are just millions of varied web URLs that do *not* adhere to any regular
expression. I'm not sure if that answers your question below?

Thanks,
Safdar

On Sun, Jun 24, 2012 at 8:38 PM, Dave Brosius <dbrosius@mebigfatguy.com>wrote:

>  If i read what you are saying, you are _not_ using composite keys? That's
> one thing that could do it, if the first part of the composite key had a
> very very low cardinality.
>
>
> On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
>
>  Hi,
>
>  I've searched online but was unable to find any leads for the problem
> below. This mailing list seemed the most appropriate place. Apologies in
> advance if that isn't the case.
>
>  I'm running a 5-node Solandra cluster (Solr + Cassandra). I've setup the
> nodes with tokens *evenly distributed across the token space*, for a
> 5-node cluster (as evidenced below under the "effective-ownership" column
> of the "nodetool ring" output). My data is a set of a few million crawled
> web pages, crawled using Nutch, and also indexed using the "solrindex"
> command available through Nutch. AFAIK, the key for each document generated
> from the crawled data is the URL.
>
>  Based on the "load" values for the nodes below, despite adding about 3
> million web pages to this index via the HTTP Rest API (e.g.:
> http://9.9.9.x:8983/solandra/index/update....), some nodes are still
> "empty". Specifically, nodes 9.9.9.1 and 9.9.9.3 have just a few kilobytes
> (shown in *bold* below) of the index, while the remaining 3 nodes are
> consistently getting hammered by all the data. If the RandomPartioner
> (which is what I'm using for this cluster) is supposed to achieve an even
> distribution of keys across the token space, why is it that the data below
> is skewed in this fashion? Literally, no key was yet been hashed to the
> nodes 9.9.9.1 and 9.9.9.3 below. Could someone possibly shed some light on
> this absurdity?.
>
>  [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
> Address         DC          Rack        Status State   Load
>  Effective-Owership  Token
>
>                  136112946768375385385349842972707284580
> 9.9.9.0       datacenter1 rack1       Up     Normal  7.57 GB
> 20.00%              0
> 9.9.9.1       datacenter1 rack1       Up     Normal  *21.44 KB*
>  20.00%              34028236692093846346337460743176821145
> 9.9.9.2       datacenter1 rack1       Up     Normal  14.99 GB
>  20.00%              68056473384187692692674921486353642290
> 9.9.9.3       datacenter1 rack1       Up     Normal  *50.79 KB*
>  20.00%              102084710076281539039012382229530463435
> 9.9.9.4       datacenter1 rack1       Up     Normal  15.22 GB
>  20.00%              136112946768375385385349842972707284580
>
>  Thanks in advance.
>
>  Regards,
>  Safdar
>
>
>

Mime
View raw message