Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7DA1910101 for ; Thu, 3 Apr 2014 07:45:36 +0000 (UTC) Received: (qmail 24580 invoked by uid 500); 3 Apr 2014 07:45:33 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 24500 invoked by uid 500); 3 Apr 2014 07:45:29 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 24477 invoked by uid 99); 3 Apr 2014 07:45:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 07:45:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of apoorva.gaurav@myntra.com designates 209.85.213.179 as permitted sender) Received: from [209.85.213.179] (HELO mail-ig0-f179.google.com) (209.85.213.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 07:45:22 +0000 Received: by mail-ig0-f179.google.com with SMTP id hl10so1446865igb.0 for ; Thu, 03 Apr 2014 00:45:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=myntra.com; s=myntra; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=7n5ArdJN5DEFGjFlQi8Skfij3wjXufVnDDps6wWAU90=; b=gqSjkZfF6Gn4V8Ipvaw1ZsYZWVdm4ZVXkiJZCBEOXovUVDrZJZjIsi87QAPeDeompj MX3uenOSexiXjNlDexl89LlblAWjOYqV7/PKfT4ZaXyeHikG5UOeDV2d8Ix+OWec5XlU Kp4J4DBzeYMLdf+0YhwarCOPVHJAa4Hu5TNrc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=7n5ArdJN5DEFGjFlQi8Skfij3wjXufVnDDps6wWAU90=; b=UuIKNUdm414MVBR6tLCOwNRaz47eDTPSJCfqsItAW1PyjL9ORzXdwl0V1XVMX3EMV0 bc8SDnNb8X+UuMYxxRBLNmTpXrUd4C7rLMYq/KyYeTpAxgjYxZEEcWCu8sJwBMBpFbKU 5Iaz70hUbel1SkjMCqy0vr/xmGSpzQwuDUVnYIflXYlM44au6+bo/oWnvs57+keUXLEY TNwESycXsXT4GbzNtR1Vbnz4h86MUNQtov50Ec2P+1/XNKF9GcLxg+cYjKEtE3Say6yg ahJ9I01h39zdSTwAp2mYr32fixJlJ7/eX4ra5uFAGriH+azrsVg8wosRCYBfEk9pdlew 0JKg== X-Gm-Message-State: ALoCoQnx2r0/vmhYSCCIplEeIGXdzEajV87/dFjePVJsQ0AXdf38DehfgqUhJzIkgUx51Z5KZ8PF X-Received: by 10.43.138.210 with SMTP id it18mr4530692icc.23.1396511100096; Thu, 03 Apr 2014 00:45:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.223.20 with HTTP; Thu, 3 Apr 2014 00:44:39 -0700 (PDT) In-Reply-To: References: From: Apoorva Gaurav Date: Thu, 3 Apr 2014 13:14:39 +0530 Message-ID: Subject: Re: Read performance in map data type To: user Content-Type: multipart/alternative; boundary=001a11c2ec320be32e04f61e9467 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2ec320be32e04f61e9467 Content-Type: text/plain; charset=ISO-8859-1 client side socket limit : 64K client side maximum connection per host : 8 read consistency level : Quorum On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak wrote: > How about the client side socket limits? Cassandra client side maximum > connection per host and read consistency level? > > ~Shrikar > > > On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav > wrote: > >> At the client side we are getting a latency of ~350ms, we are using >> datastax driver 2.0.0 and have kept the fetch size as 500. And these are >> coming while reading rows having ~200 columns. >> >> >> On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote: >> >>> Hi Apoorva, >>> As per the cfhistogram there are some rows which have more than 75k >>> columns and around 150k reads hit 2 SStables. >>> >>> Are you sure that you are seeing more than 500ms latency? The >>> cfhistogram should the worst read performance was around 51ms >>> which looks reasonable with many reads hitting 2 sstables. >>> >>> Thanks, >>> Shrikar >>> >>> >>> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav < >>> apoorva.gaurav@myntra.com> wrote: >>> >>>> Hello Shrikar, >>>> >>>> We are still facing read latency issue, here is the histogram >>>> http://pastebin.com/yEvMuHYh >>>> >>>> >>>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav < >>>> apoorva.gaurav@myntra.com> wrote: >>>> >>>>> Hello Shrikar, >>>>> >>>>> Yes primary key is (studentID, subjectID). I had dropped the test >>>>> table, recreating and populating it post which will share the cfhistogram. >>>>> In such case is there any practical limit on the rows I should fetch, for >>>>> e.g. >>>>> should I do >>>>> select * form marks_table where studentID = ? limit 500; >>>>> instead of doing >>>>> select * form marks_table where studentID = ?; >>>>> >>>>> >>>>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: >>>>> >>>>>> Hi Apoorva, >>>>>> >>>>>> I assume this is the table with studentId and subjectId as primary >>>>>> keys and not other like like marks in that. >>>>>> >>>>>> create table marks_table(studentId int, subjectId int, marks int, >>>>>> PRIMARY KEY(studentId,subjectId)); >>>>>> >>>>>> Also could you give the cfhistogram stats? >>>>>> >>>>>> nodetool cfhistograms marks_table; >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Shrikar >>>>>> >>>>>> >>>>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >>>>>> apoorva.gaurav@myntra.com> wrote: >>>>>> >>>>>>> Hello All, >>>>>>> >>>>>>> We've a schema which can be modeled as (studentID, subjectID, marks) >>>>>>> where combination of studentID and subjectID is unique. Number of studentID >>>>>>> can go up to 100 million and for each studentID we can have up to 10k >>>>>>> subjectIDs. >>>>>>> >>>>>>> We are using apahce cassandra 2.0.4 and datastax java driver >>>>>>> 1.0.4. We are using a four node cluster, each having 24 cores and 32GB >>>>>>> memory. I'm sure that the machines are not underperformant as on same test >>>>>>> bed we've consistently received <5ms response times for ~1b documents when >>>>>>> queried via primary key. >>>>>>> >>>>>>> I've tried three approaches, all of which result in significant >>>>>>> deterioration (>500 ms response time) in read query performance once number >>>>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :- >>>>>>> >>>>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>>>>>> int>) and query by subjectID >>>>>>> >>>>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY >>>>>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>>>>> studentID = ? >>>>>>> >>>>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY >>>>>>> KEY(studentID, subjectID) and query as select * from marks_table where >>>>>>> studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in >>>>>>> query being ~1K. >>>>>>> >>>>>>> What can be the bottlenecks. Is it better if we model as (studentID >>>>>>> int, subjct_marks_json text) and query by studentID. >>>>>>> >>>>>>> -- >>>>>>> Thanks & Regards, >>>>>>> Apoorva >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Apoorva >>>>> >>>> >>>> >>>> >>>> -- >>>> Thanks & Regards, >>>> Apoorva >>>> >>> >>> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > -- Thanks & Regards, Apoorva --001a11c2ec320be32e04f61e9467 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
client side socket limit : 64K
client side maximum connection per hos= t : 8
read consistency level : Quorum=A0


On Thu, Apr 3, 2014 at 12:59 PM, Shrikar= archak <shrikar84@gmail.com> wrote:
How about the client side socket limits? Cassandra client = side maximum connection per host and read consistency level?

~Shrikar


=
On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav <apoorva.gaurav@my= ntra.com> wrote:
At the client side we are getting a latency of ~350ms, we = are using datastax driver 2.0.0 and have kept the fetch size as 500. And th= ese are coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar= archak <shrikar84@gmail.com> wrote:
Hi Apoorva,
As per the cfhistogram there are some rows= which have more than 75k columns and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms late= ncy? =A0The cfhistogram should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.
<= br>
Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <apoorva.gaurav@my= ntra.com> wrote:
Hello Shrikar,

We are still facing read latency issue, here is the histogram=A0http://pastebin.com/= yEvMuHYh


On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gau= rav <apoorva.gaurav@myntra.com> wrote:
Hello Shrikar,

Yes primary key is (stud= entID, subjectID). I had dropped the test table, recreating and populating = it post which will share the cfhistogram. In such case is there any practic= al limit on the rows I should fetch, for e.g.
should I do
=A0 =A0 =A0 =A0select * form marks_table where s= tudentID =3D ? limit 500;
instead of doing=A0
=A0 =A0 = =A0 =A0select * form marks_table where studentID =3D ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar= archak <shrikar84@gmail.com> wrote:
Hi Apoorva,

I assume this is the table = with studentId and subjectId =A0as primary keys and not other like like mar= ks in that.

create table marks_table(studentId int= , subjectId int, marks int, PRIMARY KEY(studentId,subjectId));

Also could you give the cfhistogram stats?
<= div>
nodetool cfhistograms <your keyspace> marks_table;=



Thanks,
Shrikar


On Fri, Mar 28, 2014 at 3:53 PM, Apoorva= Gaurav <apoorva.gaurav@myntra.com> wrote:
Hello All,

We've a schema which can= be modeled as (studentID, subjectID, marks) where combination of studentID= and subjectID is unique. Number of studentID can go up to 100 million and = for each studentID we can have up to =A010k subjectIDs.=A0

We are using apahce cassandra 2.0.4 and datastax java d= river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3= 2GB memory.=A0I'm sure that the machines are not underperformant as on = same test bed we've consistently received <5ms response times for ~1= b documents when queried via primary key.=A0

I've tried three approaches, all of which result in= significant deterioration (>500 ms response time) in read query perform= ance once number of subjectIDs goes past ~100 for a studentID. Approaches a= re :-

1. model as (studentID int PRIMARY KEY, subjectID_marks= _map map<int, int>) and query by subjectID

2= . model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,= subjectID) and query as select * from marks_table where studentID =3D ?

3. model as (studentID int, subjectID int, marks i= nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl= e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec= tIDs in query being ~1K.

What can be the bottlenecks. Is it better if we m= odel as (studentID int, subjct_marks_json text) and query by studentID.

--
Thanks & Reg= ards,
Apoorva




--=
Thanks & Regards,
Apoorva
=


--
Thanks & Regards,
Apoorv= a




--
= Thanks & Regards,
Apoorva




--
= Thanks & Regards,
Apoorva
--001a11c2ec320be32e04f61e9467--