Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of feestend@gmail.com designates
 209.85.214.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4FB3FA15.60204@mebigfatguy.com>
References: 
 <CAPjXCuw7+maeHmf4aQ93thdy3xPs4JbKE6VKFomiNChV4NQZ8Q@mail.gmail.com>
	<63CCA5D3F3175843B5C153AD218C2FBF08E498@MSEXCHM83.morningstar.com>
	<CAPjXCux9FXU89P6OcQGesiL83eaFiUbwR-FoFoY6j5A-AgSF3g@mail.gmail.com>
	<4FB3FA15.60204@mebigfatguy.com>
Date: Wed, 16 May 2012 21:26:57 +0200
Message-ID: 
 <CAPjXCuyUa5YPNnxRBqe9AeSr21jKsT4wco-aZNE6b9v_TMYn1A@mail.gmail.com>
Subject: Re: understanding of native indexes: limitations,
 potential side effects,...
From: David Vanderfeesten <feestend@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=00151759338c78ff3204c02c4da5

--00151759338c78ff3204c02c4da5
Content-Type: text/plain; charset=ISO-8859-1

This corresponds with my thoughts, but I don't see the issue with high
cardinality columns. In worst case you get potentially as many rows in the
index as in the indexed cf (each having one column).

On Wed, May 16, 2012 at 9:03 PM, Dave Brosius <dbrosius@mebigfatguy.com>wrote:

>  Each index you define on the source CF is created using an internal CF
> that has as its key the value of the column it's indexing, and as its
> columns, all the keys of all the rows in the source CF that have that
> value. So if all your rows in your source CF have the same value, then your
> index cf will have one row with N columns for each N rows in the original
> CF.
>
>
>
>
> On 05/16/2012 02:58 PM, David Vanderfeesten wrote:
>
> Txs Jeremiah,
> But I am not sure I am following " number of columns could be equal to
> number of rows ".  Is native index implemented as one cf shared over all
> the indexes (one row in the idx CF corresponding to one index) or  is there
> an internal index cf per index?. My (potential wrong) mindset was the
> latter. In that case if you would index a column with a very high
> cardinality like for example serialNbr,  this corresponding internal idx cf
> will just lead to almost the same nbr of rows as the original cf containing
> the serialnbr. I can''t match that what you are explaining...
>
> - David
>
> On Wed, May 16, 2012 at 6:23 PM, Jeremiah Jordan <
> JEREMIAH.JORDAN@morningstar.com> wrote:
>
>>  The limitation is because number of columns could be equal to number of
>> rows.  If number of rows is large this can become an issue.
>>
>> -Jeremiah
>>
>>  ------------------------------
>> *From:* David Vanderfeesten [feestend@gmail.com]
>> *Sent:* Wednesday, May 16, 2012 6:58 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* understanding of native indexes: limitations, potential side
>> effects,...
>>
>>   Hi
>>
>> I like to better understand the limitations of native indexes, potential
>> side effects and scenarios where they are required.
>>
>> My understanding so far :
>> - Is that indexes on each node are storing indexes for data locally on
>> the node itself.
>> - Indexes do not return values in a sorted way (hashes of the indexed row
>> keys are defining the order)
>> - Given by the design referred in the first bullet, a coordinator node
>> receiving a read of a native index, needs to spawn a read to multiple
>> nodes(set of nodes together covering at least the complete key space +
>> potentially more to assure read consistency level).
>> - Each write to an indexed column leads to an additional local read of
>> the index to update the index (kind of obvious but easily forgotten when
>> tuning your system for write-only workload)
>> - When using a where clause in CQL you need at least to specify an equal
>> condition on a native indexed column. Additional conditions in the where
>> clause are filtered out by the coordinator node receiving the CQL query.
>> - native indexes do not support very well columns with high number of
>> discrete values throughout the entire CF.
>>
>> Is upper understanding correct and complete?
>> Some doubts:
>> - about the limitation of indexing columns with high number of discrete
>> values:
>> I assume native indexes  are implemented with an internally managed CF
>> per index. With high cardinality values, in worst case, the number of rows
>> in the index are identical to the number of rows of the indexed CF. Or are
>> there other reasons for the limitation, and if that's the case, is there
>> a guideline on the max. nbr of cardinality that is still reasonable?
>> -Are column updates and the update of the indexes (read + write action)
>> atomic and isolated from concurrent updates?
>>
>> Txs!
>>
>> David
>>
>>
>>
>>
>>
>
>

--00151759338c78ff3204c02c4da5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

This corresponds with my thoughts, but I don&#39;t see the issue with high =
cardinality columns. In worst case you get potentially as many rows in the =
index as in the indexed cf (each having one column).<br><br><div class=3D"g=
mail_quote">
On Wed, May 16, 2012 at 9:03 PM, Dave Brosius <span dir=3D"ltr">&lt;<a href=
=3D"mailto:dbrosius@mebigfatguy.com" target=3D"_blank">dbrosius@mebigfatguy=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    Each index you define on the source CF is created using an internal
    CF that has as its key the value of the column it&#39;s indexing, and a=
s
    its columns, all the keys of all the rows in the source CF that have
    that value. So if all your rows in your source CF have the same
    value, then your index cf will have one row with N columns for each
    N rows in the original CF.<div><div class=3D"h5"><br>
    <br>
    <br>
    <br>
    On 05/16/2012 02:58 PM, David Vanderfeesten wrote:
    <blockquote type=3D"cite">Txs Jeremiah,<br>
      But I am not sure I am following &quot; number of columns could be
      equal to number of rows &quot;.=A0 Is native index implemented as one=
 cf
      shared over all the indexes (one row in the idx CF corresponding
      to one index) or=A0 is there an internal index cf per index?. My
      (potential wrong) mindset was the latter. In that case if you
      would index a column with a very high cardinality like for example
      serialNbr,=A0 this corresponding internal idx cf will just lead to
      almost the same nbr of rows as the original cf containing the
      serialnbr. I can&#39;&#39;t match that what you are explaining...<br>
      <br>
      - David<br>
      <br>
      <div class=3D"gmail_quote">On Wed, May 16, 2012 at 6:23 PM, Jeremiah
        Jordan <span dir=3D"ltr">&lt;<a href=3D"mailto:JEREMIAH.JORDAN@morn=
ingstar.com" target=3D"_blank">JEREMIAH.JORDAN@morningstar.com</a>&gt;</spa=
n>
        wrote:<br>
        <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex">
          <div>
            <div style=3D"direction:ltr;font-size:10pt;font-family:Helvetic=
a">
              The limitation is because number of columns could be equal
              to number of rows.=A0 If number of rows is large this can
              become an issue.<br>
              <br>
              -Jeremiah<br>
              <br>
              <div style=3D"font-size:16px;font-family:Times New Roman">
                <hr>
                <div style=3D"direction:ltr"><font color=3D"#000000" face=
=3D"Tahoma"><b>From:</b> David Vanderfeesten [<a href=3D"mailto:feestend@gm=
ail.com" target=3D"_blank">feestend@gmail.com</a>]<br>
                    <b>Sent:</b> Wednesday, May 16, 2012 6:58 AM<br>
                    <b>To:</b> <a href=3D"mailto:user@cassandra.apache.org"=
 target=3D"_blank">user@cassandra.apache.org</a><br>
                    <b>Subject:</b> understanding of native indexes:
                    limitations, potential side effects,...<br>
                  </font><br>
                </div>
                <div>
                  <div>
                    <div>Hi<br>
                      <br>
                      I like to better understand the limitations of
                      native indexes, potential side effects and
                      scenarios where they are required.<br>
                      <br>
                      <span>My understanding so far :<br>
                        - Is that indexes on each node are storing
                        indexes for data locally on the node itself.<br>
                        - Indexes do not return values in a sorted way
                        (hashes of the indexed row keys are defining the
                        order)<br>
                        - Given by the design referred in the first
                        bullet, a coordinator node receiving a read of a
                        native index, needs to spawn a read to multiple
                        nodes(set of nodes together covering at least
                        the complete key space + potentially more to
                        assure read consistency level).
                        <br>
                        - Each write to an indexed column leads to an
                        additional local read of the index to update the
                        index (kind of obvious but easily forgotten when
                        tuning your system for write-only workload)</span><=
br>
                      - When using a where clause in CQL you need at
                      least to specify an equal condition on a native
                      indexed column. Additional conditions in the where
                      clause are filtered out by the coordinator node
                      receiving the CQL query.<br>
                      - native indexes do not support very well columns
                      with high number of discrete values throughout the
                      entire CF.<br>
                      <br>
                      Is upper understanding correct and complete? <br>
                      Some doubts: <br>
                      - about the limitation of indexing columns with
                      high number of discrete values: <br>
                      I assume native indexes=A0 are implemented with an
                      internally managed CF per index. With high
                      cardinality values, in worst case, the number of
                      rows in the index are identical to the number of
                      rows of the indexed CF. Or are there other reasons
                      for the limitation, and if that&#39;s the case, <span=
>is
                        there a guideline on the max. nbr of cardinality
                        that is still reasonable?
                      </span><br>
                      -Are column updates and the update of the indexes
                      (read + write action) atomic and isolated from
                      concurrent updates?
                      <br>
                      <br>
                      <span>Txs!<br>
                        <br>
                        David<br>
                        <br>
                        <br>
                        <br>
                        <br>
                      </span></div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
      <br>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br>

--00151759338c78ff3204c02c4da5--