incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kaj Magnus Lindberg <kajmagnu...@gmail.com>
Subject Re: Why no need to query all nodes on secondary index lookup?
Date Tue, 06 Sep 2011 09:31:37 GMT
Hi Martin

Yes that was helpful, thanks

(I had no idea you were reading the Cassandra users list!  :-)  )

Thanks, (Kaj) Magnus L


On Mon, Sep 5, 2011 at 10:57 PM, Martin von Zweigbergk
<martin.von.zweigbergk@gmail.com> wrote:
> Hi Magnus,
>
> I think the answer might be on
> https://issues.apache.org/jira/browse/CASSANDRA-749. For example,
> Jonathan writes:
>
> <quote>
>> Is it worth creating a secondary index that only contains local data, versus a distributed
secondary index (a normal ColumnFamily?)
>
> I think my initial reasoning was wrong here. I was anti-local-indexes
> because "we have to query the full cluster for any index lookup, since
> we are throwing away our usual partitioning scheme."
>
> Which is true, but it ignores the fact that, in most cases, you will
> have to "query the full cluster" to get the actual matching rows, b/c
> the indexed rows will be spread across all machines. So, having local
> indexes is better in the common case, since it actually saves a round
> trip from querying a the index to querying the rows.
>
> Also, having each node index the rows it has locally means you don't
> have to worry about sharding a very large index since it happens
> automatically.
>
> Finally, it lets us use the local commitlog to keep index + data in sync.
> </quote>
>
> Hope that helps,
> Martin
>
> On Mon, Sep 5, 2011 at 1:52 AM, Kaj Magnus Lindberg
> <kajmagnus79@gmail.com> wrote:
>> Hi,
>>
>> (This is the 2nd time I'm sending this message. I sent it the first
>> time a few days ago but it does not appear in the archives.)
>>
>> I have a follow up question on a question from February 2011. In
>> short, I wonder why one won't have to query all Cassandra nodes when
>> doing a secondary index lookup -- although each node only indexes data
>> that it holds locally.
>>
>> The question and answer was:
>>  ( http://www.mail-archive.com/user@cassandra.apache.org/msg10506.html  )
>> === Question ===
>> As far as I understand automatic secondary indexes are generated for
>> node local data.
>>   In this case query by secondary index involve all nodes storing part of
>> column family to get results (?) so (if i am right) if data is spread across
>> 50 nodes then 50 nodes are involved in single query?
>> [...]
>> === Answer ===
>> In practice, local secondary indexes scale to {RF * the limit of a single
>> machine} for -low cardinality- values (ex: users living in a certain state)
>> since the first node is likely to be able to answer your question. This also
>> means they are good for performing filtering for analytics.
>> [...]
>>
>> === Now I wonder ===
>> Why would the first node be likely to be able to answer the question?
>> It stores only index entries for users on that particular machine,
>>     (says http://wiki.apache.org/cassandra/SecondaryIndexes:
>>     "Each node only indexes data that it holds locally" )
>> but users might be stored by user name? And would thus be stored on
>> many different machines? Even if they happen to live in the same
>> state?
>>
>> Why won't the client need to query the indexes of [all servers that
>> store info on users] to find all relevant users, when doing a user
>> property lookup?
>>
>>
>> Best regards, KajMagnus
>>
>

Mime
View raw message