incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apoorva Gaurav <apoorva.gau...@myntra.com>
Subject Re: Read performance in map data type
Date Sat, 29 Mar 2014 11:43:11 GMT
Hello Sourabh,

I'd prefer to do query like select * from marks_table where studentID = ?
and subjectID in (?, ?, ?....?) but if its costly then can happily delegate
the responsibility to the application layer.

Haven't tried 2.x java driver for this specific issue but tried it once
earlier and found the performance slower than 1.x; isn't so?


On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal <iitr.sourabh@gmail.com>wrote:

> Hi Apoorva,
>
> Do you always query on studentID only or do you need to query on both
> studentID and subjectID?
>
> Also, I think using the latest driver (2.x) can make querying large number
> of rows efficient.
> http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
>
>
>
>
> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> Hello Shrikar,
>>
>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>> recreating and populating it post which will share the cfhistogram. In such
>> case is there any practical limit on the rows I should fetch, for e.g.
>> should I do
>>        select * form marks_table where studentID = ? limit 500;
>> instead of doing
>>        select * form marks_table where studentID = ?;
>>
>>
>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com>wrote:
>>
>>> Hi Apoorva,
>>>
>>> I assume this is the table with studentId and subjectId  as primary keys
>>> and not other like like marks in that.
>>>
>>> create table marks_table(studentId int, subjectId int, marks int,
>>> PRIMARY KEY(studentId,subjectId));
>>>
>>> Also could you give the cfhistogram stats?
>>>
>>> nodetool cfhistograms <your keyspace> marks_table;
>>>
>>>
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>> apoorva.gaurav@myntra.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>> where combination of studentID and subjectID is unique. Number of studentID
>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>> subjectIDs.
>>>>
>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>>>> sure that the machines are not underperformant as on same test bed we've
>>>> consistently received <5ms response times for ~1b documents when queried
>>>> via primary key.
>>>>
>>>> I've tried three approaches, all of which result in significant
>>>> deterioration (>500 ms response time) in read query performance once number
>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>
>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>> int>) and query by subjectID
>>>>
>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>> studentID = ?
>>>>
>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>> query being ~1K.
>>>>
>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>> int, subjct_marks_json text) and query by studentID.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Apoorva
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>
>
> --
> Sourabh Agrawal
> Bangalore
> +91 9945657973
>



-- 
Thanks & Regards,
Apoorva

Mime
View raw message