incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apoorva Gaurav <apoorva.gau...@myntra.com>
Subject Re: Read performance in map data type
Date Thu, 03 Apr 2014 07:20:11 GMT
At the client side we are getting a latency of ~350ms, we are using
datastax driver 2.0.0 and have kept the fetch size as 500. And these are
coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak <shrikar84@gmail.com> wrote:

> Hi Apoorva,
> As per the cfhistogram there are some rows which have more than 75k
> columns and around 150k reads hit 2 SStables.
>
> Are you sure that you are seeing more than 500ms latency?  The cfhistogram
> should the worst read performance was around 51ms
> which looks reasonable with many reads hitting 2 sstables.
>
> Thanks,
> Shrikar
>
>
> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> Hello Shrikar,
>>
>> We are still facing read latency issue, here is the histogram
>> http://pastebin.com/yEvMuHYh
>>
>>
>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
>> apoorva.gaurav@myntra.com> wrote:
>>
>>> Hello Shrikar,
>>>
>>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>>> recreating and populating it post which will share the cfhistogram. In such
>>> case is there any practical limit on the rows I should fetch, for e.g.
>>> should I do
>>>        select * form marks_table where studentID = ? limit 500;
>>> instead of doing
>>>        select * form marks_table where studentID = ?;
>>>
>>>
>>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com>wrote:
>>>
>>>> Hi Apoorva,
>>>>
>>>> I assume this is the table with studentId and subjectId  as primary
>>>> keys and not other like like marks in that.
>>>>
>>>> create table marks_table(studentId int, subjectId int, marks int,
>>>> PRIMARY KEY(studentId,subjectId));
>>>>
>>>> Also could you give the cfhistogram stats?
>>>>
>>>> nodetool cfhistograms <your keyspace> marks_table;
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Shrikar
>>>>
>>>>
>>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>>> apoorva.gaurav@myntra.com> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>>> where combination of studentID and subjectID is unique. Number of studentID
>>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>>> subjectIDs.
>>>>>
>>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>>>> are using a four node cluster, each having 24 cores and 32GB memory.
I'm
>>>>> sure that the machines are not underperformant as on same test bed we've
>>>>> consistently received <5ms response times for ~1b documents when queried
>>>>> via primary key.
>>>>>
>>>>> I've tried three approaches, all of which result in significant
>>>>> deterioration (>500 ms response time) in read query performance once
number
>>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>>
>>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>>> int>) and query by subjectID
>>>>>
>>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>> studentID = ?
>>>>>
>>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>>> query being ~1K.
>>>>>
>>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>>> int, subjct_marks_json text) and query by studentID.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Apoorva
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


-- 
Thanks & Regards,
Apoorva

Mime
View raw message