From user-return-33284-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Apr 9 04:36:48 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 159FBFF73 for ; Tue, 9 Apr 2013 04:36:48 +0000 (UTC) Received: (qmail 1492 invoked by uid 500); 9 Apr 2013 04:36:45 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 1461 invoked by uid 500); 9 Apr 2013 04:36:45 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 1450 invoked by uid 99); 9 Apr 2013 04:36:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 04:36:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a58.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 04:36:40 +0000 Received: from homiemail-a58.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a58.g.dreamhost.com (Postfix) with ESMTP id 020757D8060 for ; Mon, 8 Apr 2013 21:36:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=GEakcC6C8BRT/aMt333poJlb2A Q=; b=H/dhVv3MGebXO3XmAXf5jfG8zf+ldM22ZbmQk9lNtgaxbd7SK1s2+O6EO8 uEZoUCxSCLJcry/WyYJaAK5VWKT2wG/wqf6qoMrU4LFCoY9zs6gQrHSMQALjyE0n rKpK3dpHUzHZNWGhMBOge6DKKRS2E7ZspzvAuYOWq6e2N2u2E= Received: from [172.16.1.8] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a58.g.dreamhost.com (Postfix) with ESMTPSA id 17BD87D805B for ; Mon, 8 Apr 2013 21:36:30 -0700 (PDT) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_92758F3A-1EC5-4AF2-8BC5-5AB1AA626A62" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Data Modeling: How to keep track of arbitrarily inserted column names? Date: Tue, 9 Apr 2013 16:36:16 +1200 References: <682A0397-D205-4A46-BBDD-1C0F27DE7762@venarc.com> <65958708-4EBB-4EB8-8C6A-C0E7EF81082D@venarc.com> To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_92758F3A-1EC5-4AF2-8BC5-5AB1AA626A62 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 If you create a reverse index on all column names, where the single row = has a key something like "the_index" and each column name is the column = name that has been used else where, you are approaching the "twitter = global timeline anti pattern"(=99).=20 Basically you will end up with a hot row that has to handle 100k inserts = a second. It would be a good idea to do some tests if that is your = target throughput. Your design options are to consider sharding the = index using something simple like hash and mod or consistent sharding = like C* does.=20 Hope that helps.=20 =20 ----------------- Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 6/04/2013, at 7:37 AM, Drew Kutcharian wrote: > One thing I can do is to have a client-side cache of the keys to = reduce the number of updates. >=20 >=20 > On Apr 5, 2013, at 6:14 AM, Edward Capriolo = wrote: >=20 >> Since there are few column names what you can do is this. Make a = reverse index, low read repair chance, Be aggressive with compaction. It = will be many extra writes but that is ok.=20 >>=20 >> Other option is turn on row cache and try read before write. It is a = good case for row cache because it is a very small data set. >>=20 >> On Thursday, April 4, 2013, Drew Kutcharian wrote: >> > I don't really need to answer "what rows contain column named X", = so no need for a reverse index here. All I want is a distinct set of all = the column names, so I can answer "what are all the available column = names" >> > >> > On Apr 4, 2013, at 4:20 PM, Edward Capriolo = wrote: >> > >> > Your reverse index of "which rows contain a column named X" will = have very wide rows. You could look at cassandra's secondary indexing, = or possibly look at a solandra/solr approach. Another option is you can = shift the problem slightly, "which rows have column X that was added = between time y and time z". Remember with few distinct column names that = reverse index of column to row is going to be a very big list. >> > >> > >> > On Thu, Apr 4, 2013 at 5:45 PM, Drew Kutcharian = wrote: >> >> >> >> Hi Edward, >> >> I anticipate that the column names will be reused a lot. For = example, key1 will be in many rows. So I think the number of distinct = column names will be much much smaller than the number of rows. Is there = a way to have a separate CF that keeps track of the column names?=20 >> >> What I was thinking was to have a separate CF that I write only = the column name with a null value in there every time I write a = key/value to the main CF. In this case if that column name exist, then = it will just be overridden. Now if I wanted to get all the column names, = then I can just query that CF. Not sure if that's the best approach at = high load (100k inserts a second). >> >> -- Drew >> >> >> >> On Apr 4, 2013, at 12:02 PM, Edward Capriolo = wrote: >> >> >> >> You can not get only the column name (which you are calling a key) = you can use get_range_slice which returns all the columns. When you = specify an empty byte array (new byte[0]{}) as the start and finish you = get back all the columns. =46rom there you can return only the columns = to the user in a format that you like. >> >> >> >> >> >> On Thu, Apr 4, 2013 at 2:18 PM, Drew Kutcharian = wrote: >> >>> >> >>> Hey Guys, >> >>> >> >>> I'm working on a project and one of the requirements is to have a = schema free CF where end users can insert arbitrary key/value pairs per = row. What would be the best way to know what are all the "keys" that = were inserted (preferably w/o any locking). For example, >> >>> >> >>> Row1 =3D> key1 -> XXX, key2 -> XXX >> >>> Row2 =3D> key1 -> XXX, key3 -> XXX >> >>> Row3 =3D> key4 -> XXX, key5 -> XXX >> >>> Row4 =3D> key2 -> XXX, key5 -> XXX >> >>> =85 >> >>> >> >>> The query would be give me all the inserted keys and the response = would be {key1, key2, key3, key4, key5} >> >>> >> >>> Thanks, >> >>> >> >>> Drew >> >>> >> >> >> >> >> > >> > >> > >=20 --Apple-Mail=_92758F3A-1EC5-4AF2-8BC5-5AB1AA626A62 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 If = you create a reverse index on all column names, where the single row has = a key something like "the_index" and each column name is the column name = that has been used else where, you are approaching the "twitter global = timeline anti pattern"(=99). 

Basically you will = end up with a hot row that has to handle 100k inserts a second. It = would be a good idea to do some tests if that is your target throughput. = Your design options are to consider sharding the index using something = simple like hash and mod or consistent sharding like C* = does. 

Hope that = helps. 
 
http://www.thelastpickle.com

On 6/04/2013, at 7:37 AM, Drew Kutcharian <drew@venarc.com> wrote:

One = thing I can do is to have a client-side cache of the keys to reduce the = number of updates.


On Apr 5, 2013, at = 6:14 AM, Edward Capriolo <edlinuxguru@gmail.com> = wrote:

Since there are few column names what you can do is this. = Make a reverse index, low read repair chance, Be aggressive with = compaction. It will be many extra writes but that is ok.

Other = option is turn on row cache and try read before write. It is a good case = for row cache because it is a very small data set.

On Thursday, April 4, 2013, Drew Kutcharian <drew@venarc.com> wrote:
> I = don't really need to answer "what rows contain column named X", so no = need for a reverse index here. All I want is a distinct set of all the = column names, so I can answer "what are all the available column = names"
>
> On Apr 4, 2013, at 4:20 PM, Edward Capriolo <edlinuxguru@gmail.com> = wrote:
>
> Your reverse index of "which rows contain a = column named X" will have very wide rows. You could look at cassandra's = secondary indexing, or possibly look at a solandra/solr approach. = Another option is you can shift the problem slightly, "which rows have = column X that was added between time y and time z". Remember with few = distinct column names that reverse index of column to row is going to be = a very big list.
>
>
> On Thu, Apr 4, 2013 at 5:45 PM, Drew Kutcharian = <drew@venarc.com> = wrote:
>>
>> Hi Edward,
>> I anticipate that = the column names will be reused a lot. For example, key1 will be in many = rows. So I think the number of distinct column names will be much much = smaller than the number of rows. Is there a way to have a separate = CF that keeps track of the column names? 
>> What I was thinking was to have a separate CF that I write only = the column name with a null value in there every time I write a = key/value to the main CF. In this case if that column name exist, then = it will just be overridden. Now if I wanted to get all the column names, = then I can just query that CF. Not sure if that's the best approach at = high load (100k inserts a second).
>> -- Drew
>>
>> On Apr 4, 2013, at 12:02 PM, = Edward Capriolo <edlinuxguru@gmail.com> = wrote:
>>
>> You can not get only the column name = (which you are calling a key) you can use get_range_slice which returns = all the columns. When you specify an empty byte array (new byte[0]{}) as = the start and finish you get back all the columns. =46rom there you can = return only the columns to the user in a format that you like.
>>
>>
>> On Thu, Apr 4, 2013 at 2:18 PM, Drew = Kutcharian <drew@venarc.com> = wrote:
>>>
>>> Hey = Guys,
>>>
>>> I'm working on a project and one = of the requirements is to have a schema free CF where end users can = insert arbitrary key/value pairs per row. What would be the best way to = know what are all the "keys" that were inserted (preferably w/o any = locking). For example,
>>>
>>> Row1 =3D> key1 -> XXX, key2 -> = XXX
>>> Row2 =3D> key1 -> XXX, key3 -> = XXX
>>> Row3 =3D> key4 -> XXX, key5 -> = XXX
>>> Row4 =3D> key2 -> XXX, key5 -> XXX
>>> =85
>>>
>>> The query would be give = me all the inserted keys and the response would be {key1, key2, key3, = key4, key5}
>>>
>>> = Thanks,
>>>
>>> Drew
>>>
>>
>>
>
>
> =


= --Apple-Mail=_92758F3A-1EC5-4AF2-8BC5-5AB1AA626A62--