cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Safdar Kureishy <safdar.kurei...@gmail.com>
Subject Re: RandomPartitioner is providing a very skewed distribution of keys across a 5-node Solandra cluster
Date Sun, 24 Jun 2012 17:34:29 GMT
An additional detail is that the CPU utilization on those nodes is
proportional to the load below, so machines 9.9.9.1 and 9.9.9.3 experience
a fraction of CPU load as compared to the remaining 3 nodes. This might
further point to the possibility that the keys are hashing minimally to the
token ranges on those nodes. I'm no expert at cryptography, but is it
possible that web URLs are not evenly distributed via MD5 hashing due to
the common prefixes they contain? (such as the "http://" prefix, or perhaps
a domain name?)? What's also interesting is that the distribution is
more-or-less even across *alternating* nodes... (0, 2, 4 -- vs -- 1, 3).

Thanks,
Safdar


On Sun, Jun 24, 2012 at 6:00 PM, Safdar Kureishy
<safdar.kureishy@gmail.com>wrote:

> Hi,
>
> I've searched online but was unable to find any leads for the problem
> below. This mailing list seemed the most appropriate place. Apologies in
> advance if that isn't the case.
>
> I'm running a 5-node Solandra cluster (Solr + Cassandra). I've setup the
> nodes with tokens *evenly distributed across the token space*, for a
> 5-node cluster (as evidenced below under the "effective-ownership" column
> of the "nodetool ring" output). My data is a set of a few million crawled
> web pages, crawled using Nutch, and also indexed using the "solrindex"
> command available through Nutch. AFAIK, the key for each document generated
> from the crawled data is the URL.
>
> Based on the "load" values for the nodes below, despite adding about 3
> million web pages to this index via the HTTP Rest API (e.g.:
> http://9.9.9.x:8983/solandra/index/update....), some nodes are still
> "empty". Specifically, nodes 9.9.9.1 and 9.9.9.3 have just a few kilobytes
> (shown in *bold* below) of the index, while the remaining 3 nodes are
> consistently getting hammered by all the data. If the RandomPartioner
> (which is what I'm using for this cluster) is supposed to achieve an even
> distribution of keys across the token space, why is it that the data below
> is skewed in this fashion? Literally, no key was yet been hashed to the
> nodes 9.9.9.1 and 9.9.9.3 below. Could someone possibly shed some light on
> this absurdity?.
>
> [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
> Address         DC          Rack        Status State   Load
>  Effective-Owership  Token
>
>                  136112946768375385385349842972707284580
> 9.9.9.0       datacenter1 rack1       Up     Normal  7.57 GB
> 20.00%              0
> 9.9.9.1       datacenter1 rack1       Up     Normal  *21.44 KB*
>  20.00%              34028236692093846346337460743176821145
> 9.9.9.2       datacenter1 rack1       Up     Normal  14.99 GB
>  20.00%              68056473384187692692674921486353642290
> 9.9.9.3       datacenter1 rack1       Up     Normal  *50.79 KB*
>  20.00%              102084710076281539039012382229530463435
> 9.9.9.4       datacenter1 rack1       Up     Normal  15.22 GB
>  20.00%              136112946768375385385349842972707284580
>
> Thanks in advance.
>
> Regards,
> Safdar
>

Mime
View raw message