Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1023610FAF for ; Thu, 3 Apr 2014 07:16:40 +0000 (UTC) Received: (qmail 76553 invoked by uid 500); 3 Apr 2014 07:16:37 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 76017 invoked by uid 500); 3 Apr 2014 07:16:35 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 76008 invoked by uid 99); 3 Apr 2014 07:16:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 07:16:32 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shrikar84@gmail.com designates 209.85.214.181 as permitted sender) Received: from [209.85.214.181] (HELO mail-ob0-f181.google.com) (209.85.214.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 07:16:27 +0000 Received: by mail-ob0-f181.google.com with SMTP id wp4so1518737obc.12 for ; Thu, 03 Apr 2014 00:16:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=KLG/REZZQMBHHP1wED/M7EqtkqrWTvnPH0fUCwDZjFo=; b=r9rsiS5fQlsGnfvnvg+BZZzzIgrWEFqLIs1fJDKeQWKOZ/26FcRi24zYj53hMy25kE XilDOPkAHS7gOVHQbexVxbehzUmp3c9f9tzAHKkvERg/6+Rff9aFUcV7XAjVLsZNA8SG 4nxUr8VLKfWASUfqTqQrN9iDX899hyycf6P10XnN+o11XBVRudAk4p4bbVyG9aytuIIU I2aQ+JXwOzGLmy5OlX3VH93WJF4LbHPBo7zU5myI/2uLRqb5OgyvNVzLFjbs1TFMh9Uu c6dZALb2blrH5jphmsAHGWVJECDqZnXqgpjb2aKNOZkkBrWZpBsyNGcQ223Irvxmli54 Zorw== X-Received: by 10.60.74.195 with SMTP id w3mr5727528oev.28.1396509365801; Thu, 03 Apr 2014 00:16:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.118.129 with HTTP; Thu, 3 Apr 2014 00:15:45 -0700 (PDT) In-Reply-To: References: From: Shrikar archak Date: Thu, 3 Apr 2014 00:15:45 -0700 Message-ID: Subject: Re: Read performance in map data type To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11346e9aac97ae04f61e2cf0 X-Virus-Checked: Checked by ClamAV on apache.org --001a11346e9aac97ae04f61e2cf0 Content-Type: text/plain; charset=ISO-8859-1 Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav wrote: > Hello Shrikar, > > We are still facing read latency issue, here is the histogram > http://pastebin.com/yEvMuHYh > > > On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav > wrote: > >> Hello Shrikar, >> >> Yes primary key is (studentID, subjectID). I had dropped the test table, >> recreating and populating it post which will share the cfhistogram. In such >> case is there any practical limit on the rows I should fetch, for e.g. >> should I do >> select * form marks_table where studentID = ? limit 500; >> instead of doing >> select * form marks_table where studentID = ?; >> >> >> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: >> >>> Hi Apoorva, >>> >>> I assume this is the table with studentId and subjectId as primary keys >>> and not other like like marks in that. >>> >>> create table marks_table(studentId int, subjectId int, marks int, >>> PRIMARY KEY(studentId,subjectId)); >>> >>> Also could you give the cfhistogram stats? >>> >>> nodetool cfhistograms marks_table; >>> >>> >>> >>> Thanks, >>> Shrikar >>> >>> >>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >>> apoorva.gaurav@myntra.com> wrote: >>> >>>> Hello All, >>>> >>>> We've a schema which can be modeled as (studentID, subjectID, marks) >>>> where combination of studentID and subjectID is unique. Number of studentID >>>> can go up to 100 million and for each studentID we can have up to 10k >>>> subjectIDs. >>>> >>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We >>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm >>>> sure that the machines are not underperformant as on same test bed we've >>>> consistently received <5ms response times for ~1b documents when queried >>>> via primary key. >>>> >>>> I've tried three approaches, all of which result in significant >>>> deterioration (>500 ms response time) in read query performance once number >>>> of subjectIDs goes past ~100 for a studentID. Approaches are :- >>>> >>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>>> int>) and query by subjectID >>>> >>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY >>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>> studentID = ? >>>> >>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY >>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>> studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in >>>> query being ~1K. >>>> >>>> What can be the bottlenecks. Is it better if we model as (studentID >>>> int, subjct_marks_json text) and query by studentID. >>>> >>>> -- >>>> Thanks & Regards, >>>> Apoorva >>>> >>> >>> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > > > -- > Thanks & Regards, > Apoorva > --001a11346e9aac97ae04f61e2cf0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Apoorva,
As per the cfhistogram there are some rows= which have more than 75k columns and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms late= ncy? =A0The cfhistogram should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.
<= br>
Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoo= rva Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

We are still facing read latency issue, here is the histogram=A0http://pastebin.com/= yEvMuHYh


On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gau= rav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

Yes primary key is (stud= entID, subjectID). I had dropped the test table, recreating and populating = it post which will share the cfhistogram. In such case is there any practic= al limit on the rows I should fetch, for e.g.
should I do
=A0 =A0 =A0 =A0select * form marks_table where s= tudentID =3D ? limit 500;
instead of doing=A0
=A0 =A0 = =A0 =A0select * form marks_table where studentID =3D ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar= archak <shrikar84@gmail.com> wrote:
Hi Apoorva,

I assume this is the table = with studentId and subjectId =A0as primary keys and not other like like mar= ks in that.

create table marks_table(studentId int= , subjectId int, marks int, PRIMARY KEY(studentId,subjectId));

Also could you give the cfhistogram stats?
<= div>
nodetool cfhistograms <your keyspace> marks_table;=



Thanks,
Shrikar


On Fri, Mar 28, 2014 at 3:53 PM, Apoorva= Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello All,

We've a schema which can= be modeled as (studentID, subjectID, marks) where combination of studentID= and subjectID is unique. Number of studentID can go up to 100 million and = for each studentID we can have up to =A010k subjectIDs.=A0

We are using apahce cassandra 2.0.4 and datastax java d= river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3= 2GB memory.=A0I'm sure that the machines are not underperformant as on = same test bed we've consistently received <5ms response times for ~1= b documents when queried via primary key.=A0

I've tried three approaches, all of which result in= significant deterioration (>500 ms response time) in read query perform= ance once number of subjectIDs goes past ~100 for a studentID. Approaches a= re :-

1. model as (studentID int PRIMARY KEY, subjectID_marks= _map map<int, int>) and query by subjectID

2= . model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,= subjectID) and query as select * from marks_table where studentID =3D ?

3. model as (studentID int, subjectID int, marks i= nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl= e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec= tIDs in query being ~1K.

What can be the bottlenecks. Is it better if we m= odel as (studentID int, subjct_marks_json text) and query by studentID.

--
Thank= s & Regards,
Apoorva



<= br>
--
Thanks & Regards,
Apoorva



--
Thanks & R= egards,
Apoorva

--001a11346e9aac97ae04f61e2cf0--