incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: Data Modeling: How to keep track of arbitrarily inserted column names?
Date Tue, 09 Apr 2013 04:36:16 GMT
If you create a reverse index on all column names, where the single row has a key something
like "the_index" and each column name is the column name that has been used else where, you
are approaching the "twitter global timeline anti pattern"(™). 

Basically you will end up with a hot row that has to handle 100k inserts a second. It would
be a good idea to do some tests if that is your target throughput. Your design options are
to consider sharding the index using something simple like hash and mod or consistent sharding
like C* does. 

Hope that helps. 
Aaron Morton
Freelance Cassandra Consultant
New Zealand


On 6/04/2013, at 7:37 AM, Drew Kutcharian <> wrote:

> One thing I can do is to have a client-side cache of the keys to reduce the number of
> On Apr 5, 2013, at 6:14 AM, Edward Capriolo <> wrote:
>> Since there are few column names what you can do is this. Make a reverse index, low
read repair chance, Be aggressive with compaction. It will be many extra writes but that is
>> Other option is turn on row cache and try read before write. It is a good case for
row cache because it is a very small data set.
>> On Thursday, April 4, 2013, Drew Kutcharian <> wrote:
>> > I don't really need to answer "what rows contain column named X", so no need
for a reverse index here. All I want is a distinct set of all the column names, so I can answer
"what are all the available column names"
>> >
>> > On Apr 4, 2013, at 4:20 PM, Edward Capriolo <> wrote:
>> >
>> > Your reverse index of "which rows contain a column named X" will have very wide
rows. You could look at cassandra's secondary indexing, or possibly look at a solandra/solr
approach. Another option is you can shift the problem slightly, "which rows have column X
that was added between time y and time z". Remember with few distinct column names that reverse
index of column to row is going to be a very big list.
>> >
>> >
>> > On Thu, Apr 4, 2013 at 5:45 PM, Drew Kutcharian <> wrote:
>> >>
>> >> Hi Edward,
>> >> I anticipate that the column names will be reused a lot. For example, key1
will be in many rows. So I think the number of distinct column names will be much much smaller
than the number of rows. Is there a way to have a separate CF that keeps track of the column
>> >> What I was thinking was to have a separate CF that I write only the column
name with a null value in there every time I write a key/value to the main CF. In this case
if that column name exist, then it will just be overridden. Now if I wanted to get all the
column names, then I can just query that CF. Not sure if that's the best approach at high
load (100k inserts a second).
>> >> -- Drew
>> >>
>> >> On Apr 4, 2013, at 12:02 PM, Edward Capriolo <>
>> >>
>> >> You can not get only the column name (which you are calling a key) you can
use get_range_slice which returns all the columns. When you specify an empty byte array (new
byte[0]{}) as the start and finish you get back all the columns. From there you can return
only the columns to the user in a format that you like.
>> >>
>> >>
>> >> On Thu, Apr 4, 2013 at 2:18 PM, Drew Kutcharian <>
>> >>>
>> >>> Hey Guys,
>> >>>
>> >>> I'm working on a project and one of the requirements is to have a schema
free CF where end users can insert arbitrary key/value pairs per row. What would be the best
way to know what are all the "keys" that were inserted (preferably w/o any locking). For example,
>> >>>
>> >>> Row1 => key1 -> XXX, key2 -> XXX
>> >>> Row2 => key1 -> XXX, key3 -> XXX
>> >>> Row3 => key4 -> XXX, key5 -> XXX
>> >>> Row4 => key2 -> XXX, key5 -> XXX
>> >>> …
>> >>>
>> >>> The query would be give me all the inserted keys and the response would
be {key1, key2, key3, key4, key5}
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Drew
>> >>>
>> >>
>> >>
>> >
>> >
>> >

View raw message