incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apoorva Gaurav <apoorva.gau...@myntra.com>
Subject Re: Read performance in map data type
Date Sat, 29 Mar 2014 02:41:50 GMT
Hello Shrikar,

Yes primary key is (studentID, subjectID). I had dropped the test table,
recreating and populating it post which will share the cfhistogram. In such
case is there any practical limit on the rows I should fetch, for e.g.
should I do
       select * form marks_table where studentID = ? limit 500;
instead of doing
       select * form marks_table where studentID = ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com> wrote:

> Hi Apoorva,
>
> I assume this is the table with studentId and subjectId  as primary keys
> and not other like like marks in that.
>
> create table marks_table(studentId int, subjectId int, marks int, PRIMARY
> KEY(studentId,subjectId));
>
> Also could you give the cfhistogram stats?
>
> nodetool cfhistograms <your keyspace> marks_table;
>
>
>
> Thanks,
> Shrikar
>
>
> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> Hello All,
>>
>> We've a schema which can be modeled as (studentID, subjectID, marks)
>> where combination of studentID and subjectID is unique. Number of studentID
>> can go up to 100 million and for each studentID we can have up to  10k
>> subjectIDs.
>>
>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>> sure that the machines are not underperformant as on same test bed we've
>> consistently received <5ms response times for ~1b documents when queried
>> via primary key.
>>
>> I've tried three approaches, all of which result in significant
>> deterioration (>500 ms response time) in read query performance once number
>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>
>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>> int>) and query by subjectID
>>
>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>> KEY(studentID, subjectID) and query as select * from marks_table where
>> studentID = ?
>>
>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>> KEY(studentID, subjectID) and query as select * from marks_table where
>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>> query being ~1K.
>>
>> What can be the bottlenecks. Is it better if we model as (studentID int,
>> subjct_marks_json text) and query by studentID.
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


-- 
Thanks & Regards,
Apoorva

Mime
View raw message