If you create a reverse index on all column names, where the single row has a key something like "the_index" and each column name is the column name that has been used else where, you are approaching the "twitter global timeline anti pattern"(). 

Basically you will end up with a hot row that has to handle 100k inserts a second. It would be a good idea to do some tests if that is your target throughput. Your design options are to consider sharding the index using something simple like hash and mod or consistent sharding like C* does. 

Hope that helps. 
 
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 6/04/2013, at 7:37 AM, Drew Kutcharian <drew@venarc.com> wrote:

One thing I can do is to have a client-side cache of the keys to reduce the number of updates.


On Apr 5, 2013, at 6:14 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:

Since there are few column names what you can do is this. Make a reverse index, low read repair chance, Be aggressive with compaction. It will be many extra writes but that is ok.

Other option is turn on row cache and try read before write. It is a good case for row cache because it is a very small data set.

On Thursday, April 4, 2013, Drew Kutcharian <drew@venarc.com> wrote:
> I don't really need to answer "what rows contain column named X", so no need for a reverse index here. All I want is a distinct set of all the column names, so I can answer "what are all the available column names"
>
> On Apr 4, 2013, at 4:20 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>
> Your reverse index of "which rows contain a column named X" will have very wide rows. You could look at cassandra's secondary indexing, or possibly look at a solandra/solr approach. Another option is you can shift the problem slightly, "which rows have column X that was added between time y and time z". Remember with few distinct column names that reverse index of column to row is going to be a very big list.
>
>
> On Thu, Apr 4, 2013 at 5:45 PM, Drew Kutcharian <drew@venarc.com> wrote:
>>
>> Hi Edward,
>> I anticipate that the column names will be reused a lot. For example, key1 will be in many rows. So I think the number of distinct column names will be much much smaller than the number of rows. Is there a way to have a separate CF that keeps track of the column names? 
>> What I was thinking was to have a separate CF that I write only the column name with a null value in there every time I write a key/value to the main CF. In this case if that column name exist, then it will just be overridden. Now if I wanted to get all the column names, then I can just query that CF. Not sure if that's the best approach at high load (100k inserts a second).
>> -- Drew
>>
>> On Apr 4, 2013, at 12:02 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>>
>> You can not get only the column name (which you are calling a key) you can use get_range_slice which returns all the columns. When you specify an empty byte array (new byte[0]{}) as the start and finish you get back all the columns. From there you can return only the columns to the user in a format that you like.
>>
>>
>> On Thu, Apr 4, 2013 at 2:18 PM, Drew Kutcharian <drew@venarc.com> wrote:
>>>
>>> Hey Guys,
>>>
>>> I'm working on a project and one of the requirements is to have a schema free CF where end users can insert arbitrary key/value pairs per row. What would be the best way to know what are all the "keys" that were inserted (preferably w/o any locking). For example,
>>>
>>> Row1 => key1 -> XXX, key2 -> XXX
>>> Row2 => key1 -> XXX, key3 -> XXX
>>> Row3 => key4 -> XXX, key5 -> XXX
>>> Row4 => key2 -> XXX, key5 -> XXX
>>>
>>>
>>> The query would be give me all the inserted keys and the response would be {key1, key2, key3, key4, key5}
>>>
>>> Thanks,
>>>
>>> Drew
>>>
>>
>>
>
>
>