Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BEEA710B75 for ; Fri, 10 Jan 2014 12:19:44 +0000 (UTC) Received: (qmail 61268 invoked by uid 500); 10 Jan 2014 12:19:36 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 61239 invoked by uid 500); 10 Jan 2014 12:19:36 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 61231 invoked by uid 99); 10 Jan 2014 12:19:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jan 2014 12:19:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nyadav.ait@gmail.com designates 209.85.160.54 as permitted sender) Received: from [209.85.160.54] (HELO mail-pb0-f54.google.com) (209.85.160.54) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jan 2014 12:19:29 +0000 Received: by mail-pb0-f54.google.com with SMTP id un15so4366751pbc.41 for ; Fri, 10 Jan 2014 04:19:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=9TcGB+C5+nq+XQ7m5Up6eLq0s4igbcexjZam/vXPsRM=; b=Ub/XlIYT5Auj5xikJmjhJe8fhqA16k6EbkYlkicYOl0xnr4nVBYCIWk/E8jm1Ec9yr NDYC5Cdf+jR9s9ICWWo+9LZYJYImZk4RIEgFzk5tNPjWE7ytZmrxCSmGYxu000VRj6Fy twZD2tsJ4pMiyWqO1uNeMOCrO8ORwcr+xs5SK/g4mATfU/qyGzk9AbQLq5q8y4nZQnwL uNHC/4uC/vbsVl74vyaAhoCBIJ+HCQAeH4zUc3YMLHChAVOAzBHmtpKqgtRzvx3RwzT0 +adtVYFTFQR+q9ZjGjJdWWRcXguYyvBZANznSS2uJ5/YsPWVppFA3t2RLX2lPhnNl9qG CTVQ== X-Received: by 10.66.254.69 with SMTP id ag5mr10912976pad.112.1389356349450; Fri, 10 Jan 2014 04:19:09 -0800 (PST) MIME-Version: 1.0 Received: by 10.68.104.165 with HTTP; Fri, 10 Jan 2014 04:18:49 -0800 (PST) In-Reply-To: References: From: Naresh Yadav Date: Fri, 10 Jan 2014 17:48:49 +0530 Message-ID: Subject: Re: Help on Designing Cassandra table for my usecase To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7b15b181acee4604ef9cbb69 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b15b181acee4604ef9cbb69 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable @Thunder I just came to know about (CASSANDRA-4511) which allows Index on Collections and that will be part of release 2.1. I hope in that case my problem will be solved by changing your designed table with tag column as set and defining secondary index on it. Is there any risk of performance problem of this design keeping in mind huge data ??? Naresh On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav wrote= : > @Thunder thanks for suggesting design but my main problem is > indexing/quering dynamic Tag on each row that is main context of each row > and most of queries will include that.. > > As an alternative to cassandra, i tried Apache Blur, in blur table i am > able to store exact same data and all queries also worked..so blur allow= s > dynamic indexing of tag column BUT moving away from cassandra, i am > loosing its strength because of that i am not confident on this decision = as > data will be huge in my case. > > Please guide me on this with better suggestions. > > Thanks > Naresh > > On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges < > thunder.stumpges@gmail.com> wrote: > >> Well I think you have essentially time-series data, which C* should >> handle well, however I think your "Tag" column is going to cause trouble= s. >> C* does have collection columns, but they are not indexable nor usable i= n >> WHERE clause. Your example has both the uniqueness of the data (primary >> key) and query filtering on potentially multiple "Tag" columns. That is = not >> supported in C* AFAIK.If it were a single Tag, that could be a column th= at >> is Indexed possibly. >> >> Ignoring that issue with the many different Tags, You could model the >> table as: >> >> CREATE TABLE metric_data ( >> metric text, >> time text, >> period text, >> tag text, >> value int, >> PRIMARY KEY( (metric,time), period, tag) >> ) >> >> That would make a composite partitioning key on metric and time meaning >> you'd always have to pass those (or else randomly page via TOKEN through >> all rows). After specifying metric and time, you could optionally also >> specify period and/or tag, and results would be ordered (clustered) by >> period. This would satisfy your queries a,b, and d but not c (as you did >> not specify time). If Time was a granularity column, does it even make >> sense to return records across differing time values? What does it mean = to >> return the 4 month rows and 1 year row in your example? Could you issue = N >> queries in this case (where N is a small number of each of your time >> granularities) ? >> >> I'm not sure how close that gets you, or if you can re-work your concept >> of Tag at all. >> Good luck. >> Thunder >> >> >> >> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kr=F6ger wrot= e: >> >>> To my eye that looks something what the traditional analytics systems >>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a >>> backend. >>> >>> Cheers, >>> Hannu >>> >>> >>> 2014/1/9 Naresh Yadav >>> >>>> Hi all, >>>> >>>> I have a use case with huge data which i am not able to design in >>>> cassandra. >>>> >>>> Table name : MetricResult >>>> >>>> Sample Data : >>>> >>>> Metric=3DSales, Time=3DMonth, Period=3DJan-10, Tag=3DU.S.A, Tag=3DPen= , >>>> Value=3D10 >>>> Metric=3DSales, Time=3DMonth, Period=3DJan-10, Tag=3DU.S.A, Tag=3DPenc= il, >>>> Value=3D20 >>>> Metric=3DSales, Time=3DMonth, Period=3DFeb-10, Tag=3DU.S.A, Tag=3DPen, >>>> Value=3D30 >>>> Metric=3DSales, Time=3DMonth, Period=3DFeb-10, Tag=3DU.S.A, Tag=3DPenc= il, >>>> Value=3D10 >>>> Metric=3DSales, Time=3DMonth, Period=3DFeb-10, Tag=3DIndia, >>>> Value=3D90 >>>> Metric=3DSales, Time=3DYear, Period=3D2010, Tag=3DU.S.A, >>>> Value=3D70 >>>> Metric=3DCost, Time=3DYear, Period=3D2010, Tag=3DCPU, >>>> Value=3D8000 >>>> Metric=3DCost, Time=3DYear, Period=3D2010, Tag=3DRAM, >>>> Value=3D4000 >>>> Metric=3DCost, Time=3DYear Period=3D2011, Tag=3DCPU, >>>> Value=3D9000 >>>> Metric=3DResource, Time=3DWeek Period=3DWeek1-2013, >>>> Value=3D100 >>>> >>>> So in above case i have case of >>>> TimeSeries data i.e Time,Period column >>>> Dynamic columns i.e Tag column >>>> Indexing on dynamic columns i.e Tag column >>>> Aggregations SUM, AVERAGE >>>> Same value comes again for a Metric, Time, Period, Tag then >>>> overwrite it >>>> >>>> Queries i need to support : >>>> -------------------------------------- >>>> a)Give data for Metric=3DSales AND Time=3DMonth >>>> O/P : 5 rows >>>> b)Give data for Metric=3DSales AND Time=3DMonth AND Period=3DJan-10 >>>> O/P : 2 rows >>>> c)Give data for Metric=3DSales AND Tag=3DU.S.A >>>> O/P : 5 rows >>>> d)Give data for Metric=3DSales AND Period=3DJan-10 AND Tag=3DU.S.A AND= Tag=3DPen >>>> O/P :1 row >>>> >>>> >>>> This table can have TB's of data and for a Metric,Period can have >>>> millions of rows. >>>> >>>> Please give suggestion to design/model this table in Cassandra. If som= e >>>> limitation in Cassandra then suggest best technology to handle this. >>>> >>>> >>>> Thanks >>>> Naresh >>>> >>> >>> >> > > > --047d7b15b181acee4604ef9cbb69 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
@Thunder
I just came to know about= (CASSANDRA-4511) which allows Index on Collec= tions and that will be part of release 2.1.
I hope in that case my problem will be solved by changing your design= ed table with tag column as set<text> and defining secondary index on= it. Is there any risk of performance problem of this design keeping in min= d huge data ???


Naresh

On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <nya= dav.ait@gmail.com> wrote:
@Thunder thanks f= or suggesting design but my main problem is indexing/quering dynamic Tag on= each row that is main context of each row and most of queries will include= that..

As an alternative to cassandra, i tried Apache Blur, in blur tabl= e i am able to store exact same data and all queries also worked..so blur= =A0 allows dynamic indexing=A0 of tag column BUT moving away from cassandra= , i am loosing its strength because of that i am not confident on this deci= sion as data will be huge in my case.

Please guide me on this with better suggestions.

Thanks
Naresh

On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <thunder.stumpg= es@gmail.com> wrote:
Well I think you have essentially time-series data, which = C* should handle well, however I think your "Tag" column is going= to cause troubles. C* does have collection columns, but they are not index= able nor usable in WHERE clause. Your example has both the uniqueness of th= e data (primary key) and query filtering on potentially multiple "Tag&= quot; columns. That is not supported in C* AFAIK.If it were a single Tag, t= hat could be a column that is Indexed possibly.=A0

Ignoring that issue with the many different Tags, You could = model the table as:

CREATE TABLE metric_data (
=A0 metric = text,
=A0 time text,
=A0 period text,
=A0 tag= text,
=A0 value int,
=A0 PRIMARY KEY( (metric,time), period, tag)<= /div>
)

That would make a composite partitioni= ng key on metric and time meaning you'd always have to pass those (or e= lse randomly page via TOKEN through all rows). After specifying metric and = time, you could optionally also specify period and/or tag, and results woul= d be ordered (clustered) by period. This would satisfy your queries a,b, an= d d but not c (as you did not specify time). If Time was a granularity colu= mn, does it even make sense to return records across differing time values?= What does it mean to return the 4 month rows and 1 year row in your exampl= e? Could you issue N queries in this case (where N is a small number of eac= h of your time granularities) ?

I'm not sure how close that gets you, or if you can= re-work your concept of Tag at all.
Good luck.
Thunder



On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kr=F6ger <hkroger@gmail.com>= wrote:
To my eye that looks something what the traditional analyt= ics systems do. You can check out e.g. Acunu Analytics which uses Cassandra= as a backend.

Cheers,
Hannu


2014/1/9 Naresh Yadav = <nyadav.ait@gm= ail.com>
Hi all,

I have a use case with= huge data which i am not able to design in cassandra.

Ta= ble name : MetricResult=A0=A0=A0=A0=A0

Sample Data :

<= div>Metric=3DSales, Time=3DMonth,=A0 Period=3DJan-10, Tag=3DU.S.A, Tag=3DPe= n,=A0=A0=A0=A0 Value=3D10
Metric=3DSales, Time=3DMonth, Period=3DJan-10, Tag=3DU.S.A, Tag=3DPencil,= =A0 Value=3D20
Metric=3DSales, Time=3DMonth, Period=3DFeb-10, Tag=3DU.S.= A, Tag=3DPen,=A0=A0=A0=A0 Value=3D30
Metric=3DSales, Time=3DM= onth, Period=3DFeb-10, Tag=3DU.S.A, Tag=3DPencil,=A0 Value=3D10
Metric=3DSales, Time=3DMonth, Period=3DFeb-10, Tag=3DIndia, =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0=A0 Value=3D90
Metric=3DSales, Time=3DYear, Peri= od=3D2010, =A0 =A0 =A0 Tag=3DU.S.A, =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 = Value=3D70
Metric=3DCost,=A0 Time=3DYear, Period=3D2010, =A0=A0 Tag=3DCP= U, =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Value=3D8000
Metric=3DCost,=A0 Time=3DYear,=A0 Period=3D2010,=A0=A0=A0 Tag=3DRAM, =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 Value=3D4000
Metric=3DCost,=A0 Time= =3DYear=A0 Period=3D2011, =A0=A0=A0 Tag=3DCPU, =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0=A0 =A0=A0 Value=3D9000
Metric=3DResource, Time=3DWeek Period=3DWeek1= -2013, =A0=A0 =A0=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 Value=3D100

So in above case i have case of
=A0=A0=A0=A0=A0=A0=A0=A0= TimeSeries data=A0 i.e Time,Period column
=A0=A0=A0=A0=A0=A0=A0=A0 Dyna= mic columns i.e Tag column
=A0=A0=A0=A0=A0=A0=A0=A0 Indexing = on dynamic columns i.e Tag column
=A0=A0=A0=A0=A0=A0=A0=A0 Aggregations SUM, AVERAGE
=A0=A0=A0= =A0=A0=A0=A0=A0 Same value comes again for a Metric, Time, Period, Tag then= overwrite it

Queries i need to support :
-----------= ---------------------------
a)Give data for Metric=3DSales AND Time=3DMonth
= =A0=A0=A0=A0=A0=A0 O/P : 5 rows
b)Give data for Metric= =3DSales AND Time=3DMonth AND Period=3DJan-10
=A0=A0=A0=A0=A0=A0 O/= P : 2 rows
c)Give data for Metric=3DSales AND Tag=3DU.S.A
=A0=A0=A0=A0=A0=A0 O/P : 5 rows
d)Give data for Metric=3DSale= s AND Period=3DJan-10 AND Tag=3DU.S.A AND Tag=3DPen
=A0= =A0=A0=A0=A0=A0 O/P :1 row


This table can have = TB's of data and for a Metric,Period can have millions of rows.

Please give suggestion to design/model this table= in Cassandra. If some limitation in Cassandra then suggest best technology= to handle this.


Thanks
Naresh





=


--047d7b15b181acee4604ef9cbb69--