How about the client side socket limits? Cassandra client side maximum connection per host and read consistency level?

~Shrikar


On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav <apoorva.gaurav@myntra.com> wrote:
At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak <shrikar84@gmail.com> wrote:
Hi Apoorva,
As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms latency?  The cfhistogram should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.

Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh


On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g.
should I do
       select * form marks_table where studentID = ? limit 500;
instead of doing 
       select * form marks_table where studentID = ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com> wrote:
Hi Apoorva,

I assume this is the table with studentId and subjectId  as primary keys and not other like like marks in that.

create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId));

Also could you give the cfhistogram stats?

nodetool cfhistograms <your keyspace> marks_table;



Thanks,
Shrikar


On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello All,

We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to  10k subjectIDs. 

We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received <5ms response times for ~1b documents when queried via primary key. 

I've tried three approaches, all of which result in significant deterioration (>500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :-

1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int, int>) and query by subjectID

2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ?

3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in query being ~1K.

What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID.

--
Thanks & Regards,
Apoorva




--
Thanks & Regards,
Apoorva



--
Thanks & Regards,
Apoorva




--
Thanks & Regards,
Apoorva