Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <4FE75097.7040204@mebigfatguy.com>
Date: Sun, 24 Jun 2012 13:38:31 -0400
From: Dave Brosius <dbrosius@mebigfatguy.com>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:12.0) Gecko/20120430 Thunderbird/12.0.1
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: RandomPartitioner is providing a very skewed distribution of
 keys across a 5-node Solandra cluster
References: 
 <CAHvtZbY7Sqg1itVd=hhw3zpPBz-QAFK=QbTBobjURQDEpiDndw@mail.gmail.com>
In-Reply-To: 
 <CAHvtZbY7Sqg1itVd=hhw3zpPBz-QAFK=QbTBobjURQDEpiDndw@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------090307000707070200080501"

This is a multi-part message in MIME format.
--------------090307000707070200080501
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

If i read what you are saying, you are _not_ using composite keys? 
That's one thing that could do it, if the first part of the composite 
key had a very very low cardinality.

On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
> Hi,
>
> I've searched online but was unable to find any leads for the problem 
> below. This mailing list seemed the most appropriate place. Apologies 
> in advance if that isn't the case.
>
> I'm running a 5-node Solandra cluster (Solr + Cassandra). I've setup 
> the nodes with tokens /evenly distributed across the token space/, for 
> a 5-node cluster (as evidenced below under the "effective-ownership" 
> column of the "nodetool ring" output). My data is a set of a few 
> million crawled web pages, crawled using Nutch, and also indexed using 
> the "solrindex" command available through Nutch. AFAIK, the key for 
> each document generated from the crawled data is the URL.
>
> Based on the "load" values for the nodes below, despite adding about 3 
> million web pages to this index via the HTTP Rest API (e.g.: 
> http://9.9.9.x:8983/solandra/index/update....), some nodes are still 
> "empty". Specifically, nodes 9.9.9.1 and 9.9.9.3 have just a few 
> kilobytes (shown in *bold* below) of the index, while the remaining 3 
> nodes are consistently getting hammered by all the data. If the 
> RandomPartioner (which is what I'm using for this cluster) is supposed 
> to achieve an even distribution of keys across the token space, why is 
> it that the data below is skewed in this fashion? Literally, no key 
> was yet been hashed to the nodes 9.9.9.1 and 9.9.9.3 below. Could 
> someone possibly shed some light on this absurdity?.
>
> [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
> Address         DC          Rack        Status State   Load           
>  Effective-Owership  Token
>                                                                       
>                      136112946768375385385349842972707284580
> 9.9.9.0       datacenter1 rack1       Up     Normal  7.57 GB         
> 20.00%              0
> 9.9.9.1       datacenter1 rack1       Up     Normal *21.44 KB*       
>  20.00%              34028236692093846346337460743176821145
> 9.9.9.2       datacenter1 rack1       Up     Normal  14.99 GB       
>  20.00%              68056473384187692692674921486353642290
> 9.9.9.3       datacenter1 rack1       Up     Normal *50.79 KB*       
>  20.00%              102084710076281539039012382229530463435
> 9.9.9.4       datacenter1 rack1       Up     Normal  15.22 GB       
>  20.00%              136112946768375385385349842972707284580
>
> Thanks in advance.
>
> Regards,
> Safdar


--------------090307000707070200080501
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    If i read what you are saying, you are _not_ using composite keys?
    That's one thing that could do it, if the first part of the
    composite key had a very very low cardinality.<br>
    <br>
    On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
    <blockquote
cite="mid:CAHvtZbY7Sqg1itVd=hhw3zpPBz-QAFK=QbTBobjURQDEpiDndw@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>Hi,</div>
        <div><br>
        </div>
        <div>I've searched online but was unable to find any leads for
          the problem below. This mailing list seemed the most
          appropriate place. Apologies in advance if that isn't the
          case.</div>
        <div><br>
        </div>
        <div>I'm running a 5-node Solandra cluster (Solr + Cassandra).
          I've setup the nodes with tokens <i>evenly distributed across
            the token space</i>, for a 5-node cluster (as evidenced
          below under the "effective-ownership" column of the "nodetool
          ring" output).&nbsp;My data is a set of a few million crawled web
          pages, crawled using Nutch, and also indexed using the
          "solrindex" command available through Nutch. AFAIK, the key
          for each document generated from the crawled data is the URL.</div>
        <div><br>
        </div>
        <div>Based on the "load" values for the nodes below, despite
          adding about 3 million web pages to this index via the HTTP
          Rest API (e.g.: <a moz-do-not-send="true"
            href="http://9.9.9.x:8983/solandra/index/update....">http://9.9.9.x:8983/solandra/index/update....</a>),
          some nodes are still "empty". Specifically, nodes 9.9.9.1 and
          9.9.9.3 have just a few kilobytes (shown in <b>bold</b>
          below) of the index, while the remaining 3 nodes are
          consistently getting hammered by all the data. If the
          RandomPartioner (which is what I'm using for this cluster) is
          supposed to achieve an even distribution of keys across the
          token space, why is it that the data below is skewed in this
          fashion? Literally, no key was yet been hashed to the nodes
          9.9.9.1 and 9.9.9.3 below. Could someone possibly shed some
          light on this absurdity?.</div>
        <div><br>
        </div>
        <div>[me@hm1 solandra-app]$ bin/nodetool -h hm1 ring</div>
        <div>Address &nbsp; &nbsp; &nbsp; &nbsp; DC &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Rack &nbsp; &nbsp; &nbsp; &nbsp;Status State &nbsp; Load
          &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Effective-Owership &nbsp;Token</div>
        <div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
          &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
          &nbsp;136112946768375385385349842972707284580</div>
        <div>9.9.9.0 &nbsp; &nbsp; &nbsp; datacenter1 rack1 &nbsp; &nbsp; &nbsp; Up &nbsp; &nbsp; Normal &nbsp;7.57
          GB &nbsp; &nbsp; &nbsp; &nbsp; 20.00% &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0</div>
        <div>9.9.9.1 &nbsp; &nbsp; &nbsp; datacenter1 rack1 &nbsp; &nbsp; &nbsp; Up &nbsp; &nbsp; Normal &nbsp;<b>21.44
            KB</b> &nbsp; &nbsp; &nbsp; &nbsp;20.00% &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
          &nbsp;34028236692093846346337460743176821145</div>
        <div>9.9.9.2 &nbsp; &nbsp; &nbsp; datacenter1 rack1 &nbsp; &nbsp; &nbsp; Up &nbsp; &nbsp; Normal &nbsp;14.99
          GB &nbsp; &nbsp; &nbsp; &nbsp;20.00% &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
          &nbsp;68056473384187692692674921486353642290</div>
        <div>9.9.9.3 &nbsp; &nbsp; &nbsp; datacenter1 rack1 &nbsp; &nbsp; &nbsp; Up &nbsp; &nbsp; Normal &nbsp;<b>50.79
            KB</b> &nbsp; &nbsp; &nbsp; &nbsp;20.00% &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
          &nbsp;102084710076281539039012382229530463435</div>
        <div>9.9.9.4 &nbsp; &nbsp; &nbsp; datacenter1 rack1 &nbsp; &nbsp; &nbsp; Up &nbsp; &nbsp; Normal &nbsp;15.22
          GB &nbsp; &nbsp; &nbsp; &nbsp;20.00% &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
          &nbsp;136112946768375385385349842972707284580</div>
        <div><br>
        </div>
        <div>Thanks in advance.</div>
        <div><br>
        </div>
        <div>Regards,</div>
        <div>
          Safdar</div>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------090307000707070200080501--