Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CBE11109FF for ; Sat, 29 Mar 2014 11:44:01 +0000 (UTC) Received: (qmail 41029 invoked by uid 500); 29 Mar 2014 11:43:58 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 40927 invoked by uid 500); 29 Mar 2014 11:43:57 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 40919 invoked by uid 99); 29 Mar 2014 11:43:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Mar 2014 11:43:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of apoorva.gaurav@myntra.com designates 209.85.223.177 as permitted sender) Received: from [209.85.223.177] (HELO mail-ie0-f177.google.com) (209.85.223.177) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Mar 2014 11:43:52 +0000 Received: by mail-ie0-f177.google.com with SMTP id rl12so6005301iec.22 for ; Sat, 29 Mar 2014 04:43:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=myntra.com; s=myntra; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=qgMZ67QlzrrTv1OYtOtniZdE1MXTp/6Rznj5i1U6Psk=; b=OonuglVAxp8bT3pfJMtrgk/GVe+OxfXFN36NEAotIUaH+JZ1OUb4abeolGJbo6vXPG GEaP59wKfiASHcUpIp8twPYnwMfJRhUg5STmVxNRQLWIN2tUEjrzsOYWvbnMbX9mStqD /Ppq/NETYYB5tKK3TpDN8Bqjp5icITyU2fLE4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=qgMZ67QlzrrTv1OYtOtniZdE1MXTp/6Rznj5i1U6Psk=; b=GSYy/BVuWDOiNNsHSPxUu6pYJxg6N+DWOt2+arLxYmVIDiL/nhiWQVS+n6ytAF0cEu VLtHdTS+0f23PvyQCvObLWSa4GMpqGOoqMMPHLSNV4bAYlTN1WD6YKbM+Ds9JHRs0aaq 883trypF80+64RDKKVg3OZ0yBTfmV3S6EOBc8xn+KL/hreOeRz/Dhp4sacTtht5l6cGl g0kt2ujLLPICOrFxCh+94+ZqQ3sqBzK2WgAcPifJVcLSPpAxIJTieXwwcmtJhKgXUUZ6 jGaJTagUiuyyIAYd1N1m3V46leOIDDvIUI4JM2ds7SDjfKGrwsnGezeu8Lnx/wj7PN7I U62w== X-Gm-Message-State: ALoCoQkL3/uw3gdnAr5o8DJuVc/4T2fg4gLHInpLuLSyOW1vqPf0+fkbNILLkAb0/n3VJquoS8of X-Received: by 10.42.247.132 with SMTP id mc4mr13290695icb.44.1396093411420; Sat, 29 Mar 2014 04:43:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.223.20 with HTTP; Sat, 29 Mar 2014 04:43:11 -0700 (PDT) In-Reply-To: References: From: Apoorva Gaurav Date: Sat, 29 Mar 2014 17:13:11 +0530 Message-ID: Subject: Re: Read performance in map data type To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf3005dba6dc8d5d04f5bd5322 X-Virus-Checked: Checked by ClamAV on apache.org --20cf3005dba6dc8d5d04f5bd5322 Content-Type: text/plain; charset=ISO-8859-1 Hello Sourabh, I'd prefer to do query like select * from marks_table where studentID = ? and subjectID in (?, ?, ?....?) but if its costly then can happily delegate the responsibility to the application layer. Haven't tried 2.x java driver for this specific issue but tried it once earlier and found the performance slower than 1.x; isn't so? On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal wrote: > Hi Apoorva, > > Do you always query on studentID only or do you need to query on both > studentID and subjectID? > > Also, I think using the latest driver (2.x) can make querying large number > of rows efficient. > http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 > > > > > On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav > wrote: > >> Hello Shrikar, >> >> Yes primary key is (studentID, subjectID). I had dropped the test table, >> recreating and populating it post which will share the cfhistogram. In such >> case is there any practical limit on the rows I should fetch, for e.g. >> should I do >> select * form marks_table where studentID = ? limit 500; >> instead of doing >> select * form marks_table where studentID = ?; >> >> >> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: >> >>> Hi Apoorva, >>> >>> I assume this is the table with studentId and subjectId as primary keys >>> and not other like like marks in that. >>> >>> create table marks_table(studentId int, subjectId int, marks int, >>> PRIMARY KEY(studentId,subjectId)); >>> >>> Also could you give the cfhistogram stats? >>> >>> nodetool cfhistograms marks_table; >>> >>> >>> >>> Thanks, >>> Shrikar >>> >>> >>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >>> apoorva.gaurav@myntra.com> wrote: >>> >>>> Hello All, >>>> >>>> We've a schema which can be modeled as (studentID, subjectID, marks) >>>> where combination of studentID and subjectID is unique. Number of studentID >>>> can go up to 100 million and for each studentID we can have up to 10k >>>> subjectIDs. >>>> >>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We >>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm >>>> sure that the machines are not underperformant as on same test bed we've >>>> consistently received <5ms response times for ~1b documents when queried >>>> via primary key. >>>> >>>> I've tried three approaches, all of which result in significant >>>> deterioration (>500 ms response time) in read query performance once number >>>> of subjectIDs goes past ~100 for a studentID. Approaches are :- >>>> >>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>>> int>) and query by subjectID >>>> >>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY >>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>> studentID = ? >>>> >>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY >>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>> studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in >>>> query being ~1K. >>>> >>>> What can be the bottlenecks. Is it better if we model as (studentID >>>> int, subjct_marks_json text) and query by studentID. >>>> >>>> -- >>>> Thanks & Regards, >>>> Apoorva >>>> >>> >>> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > > > -- > Sourabh Agrawal > Bangalore > +91 9945657973 > -- Thanks & Regards, Apoorva --20cf3005dba6dc8d5d04f5bd5322 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello Sourabh,

I'd prefer to = do query like=A0select * from marks_table where studentID =3D ? and subjectID in (?, ?, ?.= ...?) but if its costly then can happily delegate the responsibility to the= application layer.

Haven= 9;t tried 2.x java driver for this specific issue but tried it once earlier= and found the performance slower than 1.x; isn't so?


On Sat,= Mar 29, 2014 at 3:30 PM, Sourabh Agrawal <iitr.sourabh@gmail.com= > wrote:
Hi Apoorva,

<= div>Do you always query on studentID only or do you need to query on both s= tudentID and subjectID? =A0

Also, I think using the latest driver (2.x) can make qu= erying large number of rows efficient.=A0




On Sat, Mar 29, 2014 at 8:11 AM, Apoorva= Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

Yes primary key is (stud= entID, subjectID). I had dropped the test table, recreating and populating = it post which will share the cfhistogram. In such case is there any practic= al limit on the rows I should fetch, for e.g.
should I do
=A0 =A0 =A0 =A0select * form marks_table where s= tudentID =3D ? limit 500;
instead of doing=A0
=A0 =A0 = =A0 =A0select * form marks_table where studentID =3D ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar= archak <shrikar84@gmail.com> wrote:
Hi Apoorva,

I assume this is the table = with studentId and subjectId =A0as primary keys and not other like like mar= ks in that.

create table marks_table(studentId int= , subjectId int, marks int, PRIMARY KEY(studentId,subjectId));

Also could you give the cfhistogram stats?
<= div>
nodetool cfhistograms <your keyspace> marks_table;=



Thanks,
Shrikar


On Fri, Mar 28, 2014 at 3:53 PM, Apoorva= Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello All,

We've a schema which can= be modeled as (studentID, subjectID, marks) where combination of studentID= and subjectID is unique. Number of studentID can go up to 100 million and = for each studentID we can have up to =A010k subjectIDs.=A0

We are using apahce cassandra 2.0.4 and datastax java d= river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3= 2GB memory.=A0I'm sure that the machines are not underperformant as on = same test bed we've consistently received <5ms response times for ~1= b documents when queried via primary key.=A0

I've tried three approaches, all of which result in= significant deterioration (>500 ms response time) in read query perform= ance once number of subjectIDs goes past ~100 for a studentID. Approaches a= re :-

1. model as (studentID int PRIMARY KEY, subjectID_marks= _map map<int, int>) and query by subjectID

2= . model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,= subjectID) and query as select * from marks_table where studentID =3D ?

3. model as (studentID int, subjectID int, marks i= nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl= e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec= tIDs in query being ~1K.

What can be the bottlenecks. Is it better if we m= odel as (studentID int, subjct_marks_json text) and query by studentID.

--
Thanks & Regards,=
Apoorva




--
= Thanks & Regards,
Apoorva



<= /div>--
Sourabh Agrawal
Bangalore
+91 9945657973



--
Thanks &= Regards,
Apoorva
--20cf3005dba6dc8d5d04f5bd5322--