incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Günter Ladwig <guenter.lad...@kit.edu>
Subject Storing single rows on multiple nodes
Date Sat, 09 Jul 2011 20:27:37 GMT
Hi all,

we are currently looking at using Cassandra to store highly skewed RDF data. With the indexes
we use it may happen that a single row contains up to 20% of the whole dataset, meaning that
it can grow larger than available disk space on single nodes. In [1], it says that this limitation
is not likely to change in the future, but I was wondering if anybody has looked at this problem?


One thing that comes to mind is a simple approach to DHT load-balancing [2], where keys are
assigned to one node of several random alternatives (which means that for reading, all these
nodes have to be queried). This is a bit similar to replication, except, of course, that only
one copy of the data is stored. As this would require changes to the Cassandra code base,
we could "simulate" this by randomly choosing one of several predefined suffixes and appending
it to a key before storing it. By modifying a key this way, we could be somewhat sure that
it will be stored at a different node. The first solution would certainly be preferable.

Any thoughts or experiences? Failing that, maybe someone can give me a pointer into the Cassandra
code base, where something like the [2] should be implemented.

Cheers,
Günter

[1] http://wiki.apache.org/cassandra/CassandraLimitations
[2] Byers at el.: Simple Load Balancing for Distributed Hash Tables, http://www.springerlink.com/content/r9r4qcqxc2bmfqmr/

--  

Dipl.-Inform. Günter Ladwig

Karlsruhe Institute of Technology (KIT)
Institute AIFB

Englerstraße 11 (Building 11.40, Room 250)
76131 Karlsruhe, Germany
Phone: +49 721 608-47946
Email: guenter.ladwig@kit.edu
Web: www.aifb.kit.edu

KIT – University of the State of Baden-Württemberg and National Large-scale Research Center
of the Helmholtz Association


Mime
View raw message