cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thunder Stumpges <thunder.stump...@gmail.com>
Subject Re: Help on Designing Cassandra table for my usecase
Date Fri, 10 Jan 2014 15:01:23 GMT
It does sound like that could work for you. From the sample data it doesn't look like tag will
be high cardinality (relative to number of rows) so as long as you won't have rows with too
many tags (collections are best kept small, but they claim can be in the hundreds but not
to exceed 64k) I don't have any experience with secondary indexes under load and definitely
not with collections. 

Looks promising though!
Good luck,
Thunder



> On Jan 10, 2014, at 5:02 AM, Naresh Yadav <nyadav.ait@gmail.com> wrote:
> 
> @vivek thanks for pointing that out..Other than primary key defining only one secondary
index tags and in my case same tags will be repeating itself across period for sure for a
metric=Sales AND also across metric Sales, Cost also can be same set of tags to some extent
not always..
> 
> 
> Thanks
> Naresh
> 
> 
>> On Fri, Jan 10, 2014 at 6:05 PM, Vivek Mishra <mishra.vivs@gmail.com> wrote:
>> @Naresh
>> Too many indices or indices with high cardinality should be discouraged and are always
performance issues. A set will not contain duplicate values.
>> 
>> -Vivek
>> 
>> 
>>> On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <nyadav.ait@gmail.com> wrote:
>>> @Thunder
>>> I just came to know about (CASSANDRA-4511) which allows Index on Collections
and that will be part of release 2.1.
>>> I hope in that case my problem will be solved by changing your designed table
with tag column as set<text> and defining secondary index on it. Is there any risk of
performance problem of this design keeping in mind huge data ???
>>> 
>>> 
>>> Naresh
>>> 
>>>> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <nyadav.ait@gmail.com>
wrote:
>>>> @Thunder thanks for suggesting design but my main problem is indexing/quering
dynamic Tag on each row that is main context of each row and most of queries will include
that..
>>>> 
>>>> As an alternative to cassandra, i tried Apache Blur, in blur table i am able
to store exact same data and all queries also worked..so blur  allows dynamic indexing  of
tag column BUT moving away from cassandra, i am loosing its strength because of that i am
not confident on this decision as data will be huge in my case.
>>>> 
>>>> Please guide me on this with better suggestions.
>>>> 
>>>> Thanks
>>>> Naresh
>>>> 
>>>>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <thunder.stumpges@gmail.com>
wrote:
>>>>> Well I think you have essentially time-series data, which C* should handle
well, however I think your "Tag" column is going to cause troubles. C* does have collection
columns, but they are not indexable nor usable in WHERE clause. Your example has both the
uniqueness of the data (primary key) and query filtering on potentially multiple "Tag" columns.
That is not supported in C* AFAIK.If it were a single Tag, that could be a column that is
Indexed possibly. 
>>>>> 
>>>>> Ignoring that issue with the many different Tags, You could model the
table as:
>>>>> 
>>>>> CREATE TABLE metric_data (
>>>>>   metric text,
>>>>>   time text,
>>>>>   period text,
>>>>>   tag text,
>>>>>   value int,
>>>>>   PRIMARY KEY( (metric,time), period, tag)
>>>>> )
>>>>> 
>>>>> That would make a composite partitioning key on metric and time meaning
you'd always have to pass those (or else randomly page via TOKEN through all rows). After
specifying metric and time, you could optionally also specify period and/or tag, and results
would be ordered (clustered) by period. This would satisfy your queries a,b, and d but not
c (as you did not specify time). If Time was a granularity column, does it even make sense
to return records across differing time values? What does it mean to return the 4 month rows
and 1 year row in your example? Could you issue N queries in this case (where N is a small
number of each of your time granularities) ?
>>>>> 
>>>>> I'm not sure how close that gets you, or if you can re-work your concept
of Tag at all.
>>>>> Good luck.
>>>>> Thunder
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hkroger@gmail.com>
wrote:
>>>>>> To my eye that looks something what the traditional analytics systems
do. You can check out e.g. Acunu Analytics which uses Cassandra as a backend.
>>>>>> 
>>>>>> Cheers,
>>>>>> Hannu
>>>>>> 
>>>>>> 
>>>>>> 2014/1/9 Naresh Yadav <nyadav.ait@gmail.com>
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I have a use case with huge data which i am not able to design
in cassandra.
>>>>>>> 
>>>>>>> Table name : MetricResult      
>>>>>>> 
>>>>>>> Sample Data :
>>>>>>> 
>>>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
    Value=10
>>>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
 Value=20
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
    Value=30
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
 Value=10
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,         
            Value=90
>>>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,      
             Value=70
>>>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,           
         Value=8000
>>>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,          
         Value=4000
>>>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,          
          Value=9000
>>>>>>> Metric=Resource, Time=Week Period=Week1-2013,               
      Value=100
>>>>>>> 
>>>>>>> So in above case i have case of 
>>>>>>>          TimeSeries data  i.e Time,Period column
>>>>>>>          Dynamic columns i.e Tag column
>>>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>>>          Aggregations SUM, AVERAGE
>>>>>>>          Same value comes again for a Metric, Time, Period, Tag
then overwrite it 
>>>>>>> 
>>>>>>> Queries i need to support :
>>>>>>> --------------------------------------
>>>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>>>        O/P : 5 rows
>>>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>>>        O/P : 2 rows
>>>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>>>        O/P : 5 rows
>>>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A
AND Tag=Pen
>>>>>>>        O/P :1 row
>>>>>>> 
>>>>>>> 
>>>>>>> This table can have TB's of data and for a Metric,Period can
have millions of rows.
>>>>>>> 
>>>>>>> Please give suggestion to design/model this table in Cassandra.
If some limitation in Cassandra then suggest best technology to handle this.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Naresh
> 
> 
> 

Mime
View raw message