Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8D87D10001 for ; Thu, 3 Apr 2014 07:21:02 +0000 (UTC) Received: (qmail 86134 invoked by uid 500); 3 Apr 2014 07:20:59 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86106 invoked by uid 500); 3 Apr 2014 07:20:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86096 invoked by uid 99); 3 Apr 2014 07:20:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 07:20:58 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of apoorva.gaurav@myntra.com designates 209.85.223.177 as permitted sender) Received: from [209.85.223.177] (HELO mail-ie0-f177.google.com) (209.85.223.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 07:20:53 +0000 Received: by mail-ie0-f177.google.com with SMTP id rl12so1358341iec.8 for ; Thu, 03 Apr 2014 00:20:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=myntra.com; s=myntra; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=uDpR34kgAGzix4fy1uW3xZ8x4qr3qlybx/gUVaSKAFE=; b=OQUAW2y5qaNmcriMXQzn9JZxQOky6r11KPhYi6JT+aMETdn+azbqfbJEAM8BRn1DwN gYz1O5nfrTsUHXleZAIIy/rOL7pdjNChkarw6S89S3Pn8we0zkXbvIkSZmQqqJIrm4nw rcwA1XgsFavWdGO8atLS4PcHnI74NBBympuBY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=uDpR34kgAGzix4fy1uW3xZ8x4qr3qlybx/gUVaSKAFE=; b=Cr++IPEF/qoGhfOrSW2bpbjmiRCWNl8HRmwEM4De7Vlt7sZv7Zsu6DvDw9UpMu0Y3V ebLPv9vd6op6HdIcBPtNE1qmia8d5eCBlckd/3DdbD5O70ZPChBOubBvO/+l/DzP6z5n WKw7/env4B7HyZnfBlwFaeJZlO/0JnDKhFIGp+OKJ6hzpvwsSRhyVVl1q/10nDO/l2c1 tGtZdFD3+hkhiJKpguTQRpPxwuAuGHrSBlOKmUpFBOsEEPX70Ltk4ML5CD+7yBfMJnKD Mdp7Gugt4pVSpDeqdr17TolTHSGiZqDhxaN+iCrnXZDf97xqias3FrPglUiWxPe/c+xq YTtg== X-Gm-Message-State: ALoCoQnJ/RHdvn6pSsdwi2Q/VXP8rhb58mZ9ELzVPGaHg6FfaX23rn+01h3m8v3cVVZ5MzMSqMP5 X-Received: by 10.42.197.129 with SMTP id ek1mr4714328icb.9.1396509631240; Thu, 03 Apr 2014 00:20:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.223.20 with HTTP; Thu, 3 Apr 2014 00:20:11 -0700 (PDT) In-Reply-To: References: From: Apoorva Gaurav Date: Thu, 3 Apr 2014 12:50:11 +0530 Message-ID: Subject: Re: Read performance in map data type To: user Content-Type: multipart/alternative; boundary=20cf303bf9867ef71d04f61e3c3b X-Virus-Checked: Checked by ClamAV on apache.org --20cf303bf9867ef71d04f61e3c3b Content-Type: text/plain; charset=ISO-8859-1 At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote: > Hi Apoorva, > As per the cfhistogram there are some rows which have more than 75k > columns and around 150k reads hit 2 SStables. > > Are you sure that you are seeing more than 500ms latency? The cfhistogram > should the worst read performance was around 51ms > which looks reasonable with many reads hitting 2 sstables. > > Thanks, > Shrikar > > > On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav > wrote: > >> Hello Shrikar, >> >> We are still facing read latency issue, here is the histogram >> http://pastebin.com/yEvMuHYh >> >> >> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav < >> apoorva.gaurav@myntra.com> wrote: >> >>> Hello Shrikar, >>> >>> Yes primary key is (studentID, subjectID). I had dropped the test table, >>> recreating and populating it post which will share the cfhistogram. In such >>> case is there any practical limit on the rows I should fetch, for e.g. >>> should I do >>> select * form marks_table where studentID = ? limit 500; >>> instead of doing >>> select * form marks_table where studentID = ?; >>> >>> >>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: >>> >>>> Hi Apoorva, >>>> >>>> I assume this is the table with studentId and subjectId as primary >>>> keys and not other like like marks in that. >>>> >>>> create table marks_table(studentId int, subjectId int, marks int, >>>> PRIMARY KEY(studentId,subjectId)); >>>> >>>> Also could you give the cfhistogram stats? >>>> >>>> nodetool cfhistograms marks_table; >>>> >>>> >>>> >>>> Thanks, >>>> Shrikar >>>> >>>> >>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >>>> apoorva.gaurav@myntra.com> wrote: >>>> >>>>> Hello All, >>>>> >>>>> We've a schema which can be modeled as (studentID, subjectID, marks) >>>>> where combination of studentID and subjectID is unique. Number of studentID >>>>> can go up to 100 million and for each studentID we can have up to 10k >>>>> subjectIDs. >>>>> >>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We >>>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm >>>>> sure that the machines are not underperformant as on same test bed we've >>>>> consistently received <5ms response times for ~1b documents when queried >>>>> via primary key. >>>>> >>>>> I've tried three approaches, all of which result in significant >>>>> deterioration (>500 ms response time) in read query performance once number >>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :- >>>>> >>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>>>> int>) and query by subjectID >>>>> >>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY >>>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>>> studentID = ? >>>>> >>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY >>>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>>> studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in >>>>> query being ~1K. >>>>> >>>>> What can be the bottlenecks. Is it better if we model as (studentID >>>>> int, subjct_marks_json text) and query by studentID. >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Apoorva >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks & Regards, >>> Apoorva >>> >> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > -- Thanks & Regards, Apoorva --20cf303bf9867ef71d04f61e3c3b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
At the client side we are getting a latency of ~350ms, we = are using datastax driver 2.0.0 and have kept the fetch size as 500. And th= ese are coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar= archak <shrikar84@gmail.com> wrote:
Hi Apoorva,
As per the cfhistogram there are some rows= which have more than 75k columns and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms late= ncy? =A0The cfhistogram should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.
<= br>
Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <apoorva.gaurav@my= ntra.com> wrote:
Hello Shrikar,

We are still facing read latency issue, here is the histogram=A0http://pastebin.com/= yEvMuHYh


On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gau= rav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

Yes primary key is (stud= entID, subjectID). I had dropped the test table, recreating and populating = it post which will share the cfhistogram. In such case is there any practic= al limit on the rows I should fetch, for e.g.
should I do
=A0 =A0 =A0 =A0select * form marks_table where s= tudentID =3D ? limit 500;
instead of doing=A0
=A0 =A0 = =A0 =A0select * form marks_table where studentID =3D ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar= archak <shrikar84@gmail.com> wrote:
Hi Apoorva,

I assume this is the table = with studentId and subjectId =A0as primary keys and not other like like mar= ks in that.

create table marks_table(studentId int= , subjectId int, marks int, PRIMARY KEY(studentId,subjectId));

Also could you give the cfhistogram stats?
<= div>
nodetool cfhistograms <your keyspace> marks_table;=



Thanks,
Shrikar


On Fri, Mar 28, 2014 at 3:53 PM, Apoorva= Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello All,

We've a schema which can= be modeled as (studentID, subjectID, marks) where combination of studentID= and subjectID is unique. Number of studentID can go up to 100 million and = for each studentID we can have up to =A010k subjectIDs.=A0

We are using apahce cassandra 2.0.4 and datastax java d= river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3= 2GB memory.=A0I'm sure that the machines are not underperformant as on = same test bed we've consistently received <5ms response times for ~1= b documents when queried via primary key.=A0

I've tried three approaches, all of which result in= significant deterioration (>500 ms response time) in read query perform= ance once number of subjectIDs goes past ~100 for a studentID. Approaches a= re :-

1. model as (studentID int PRIMARY KEY, subjectID_marks= _map map<int, int>) and query by subjectID

2= . model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,= subjectID) and query as select * from marks_table where studentID =3D ?

3. model as (studentID int, subjectID int, marks i= nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl= e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec= tIDs in query being ~1K.

What can be the bottlenecks. Is it better if we m= odel as (studentID int, subjct_marks_json text) and query by studentID.

--
Thanks & Reg= ards,
Apoorva




--=
Thanks & Regards,
Apoorva
=


--
Thanks & Regards,
Apoorv= a




--
= Thanks & Regards,
Apoorva
--20cf303bf9867ef71d04f61e3c3b--