I really don't see the point.. Again, suppose a cluster with 3 nodes, where
there is a ColumnFamily that will hold data which key is basically consisted
on a word of 2 letters (pretty simple). That's make a total of 729 posible
keys.
RandomPartitioner then will tokenize each key and assign them to a node
within the cluster. Then, each node will handle 243 keys each (plus
replication, of course).
ok, Now suppose that you need to look for data on key "AG", the node that
you ask, will then use RandomPartitioner to tokenize the key and determine
which node is the coordinator for that key and proceed to ask that node for
the data (and ask the replicas an md5 version of the data to compare). So,
each node will only need to look for over 1/3 of the stored keys.
How do you think an Index is implemented? As far as I know, a simple index
is básically a HashTable that has the Index value as Key, and the position
as value. How do you think a search within the Index (Hashcode) is
implemented?
I don't know, maybe there is some magic behind indexes (I know there are
some complex indexes that hold some BTree, etc; like the one used over SQL
solutions), but I think all the whole thing will only add more complexity
over a more straight solution. How big should be the CF (in terms of keys)
to be able to present latency when searching over hashcodes? And then think,
if I need to add a new Key, what's the cost in the whole process? Now, lets
assume you can make the whole BTree in first place (even for the keys that
does not exists), how much memory would that cost? There should be some
papers that discuss this problem somewhere.
I would definitly make some volume calculations and some stress test over
this at least to be sure there is a problem before attempting any kind of
solution.
PD: I feel this is like the problem I present about TTL values, saying
basically, that a TTL value past 2050 year would throw an exception. Who
will be alive after 2012 doomsday? :)
On Thu, Feb 24, 2011 at 3:18 PM, mcasandra <mohitanchlia@gmail.com> wrote:
>
> What I am trying to ask is that what if there are billions of row keys (eg:
> abc, def, xyz in below eg.) and then client does a lookup/query on a row
> say
> xyz (get all cols for row xyz). Now since there are billions of rows look
> up
> using Hash mechanism, is it going to be slow? What algorithm will be used
> to
> retrieve row xyz which could be anywhere in those billion rows on a
> particular node.
>
> Is it going to help if there is an index on row keys (eg: abc, xyz)?
>
> > UserProfile = { // this is a ColumnFamily
> > abc: { // this is the key to this Row inside the CF
> > // now we have an infinite # of columns in this row
> > username: "phatduckk",
> > email: "phatduckk@example.com",
> > phone: "(900) 9766666"
> > }, // end row
> > def: { // this is the key to another row in the CF
> > // now we have another infinite # of columns in this row
> > username: "ieure",
> > email: "ieure@example.com",
> > phone: "(888) 5551212"
> > age: "66",
> > gender: "undecided"
> > },
> > }
> 
> View this message in context:
> http://cassandrauserincubatorapacheorg.3065146.n2.nabble.com/UnderstandingIndexestp6058238p6061356.html
> Sent from the cassandrauser@incubator.apache.org mailing list archive at
> Nabble.com.
>
