Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of apoorva.gaurav@myntra.com
 designates 209.85.213.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAF4_Gfj-ednyAN-Hm16yW8_nRBVToyJz3AqgJ7aOvE_295npSA@mail.gmail.com>
References: 
 <CAJRvvD_bvyMpqbAdFRB_Eabi8NUf=5pPk5v0mfim7H=x4gJHyA@mail.gmail.com>
 <CAF4_GfiPUuFeWukJGne8kEX8fFdxT_+6kKywvYkjTYtByTiM8g@mail.gmail.com>
 <CAJRvvD9kCiFeH-d9NDi35ZAuHMUGrd--F50OfY5xkkTt8LJS0g@mail.gmail.com>
 <CAJRvvD8pM4FQNP-WQxAoJHD9fmZSH05LoDvJQXA7NzKwtFmphQ@mail.gmail.com>
 <CAF4_GfhmF_y2NtO8XWS74M1zwT18MUXmC=e0=YD9tt=RjJff3g@mail.gmail.com>
 <CAJRvvD_dcADZp8smUS+tBH0Mh_tNYvoca00=vk6u0Pth0NmqGA@mail.gmail.com>
 <CAF4_Gfj-ednyAN-Hm16yW8_nRBVToyJz3AqgJ7aOvE_295npSA@mail.gmail.com>
From: Apoorva Gaurav <apoorva.gaurav@myntra.com>
Date: Thu, 3 Apr 2014 13:14:39 +0530
Message-ID: 
 <CAJRvvD8xdjHThYUJRn5hrQ1ryo5MjT83AiRnu8EPx41F0L0xyQ@mail.gmail.com>
Subject: Re: Read performance in map data type
To: user <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=001a11c2ec320be32e04f61e9467

--001a11c2ec320be32e04f61e9467
Content-Type: text/plain; charset=ISO-8859-1

client side socket limit : 64K
client side maximum connection per host : 8
read consistency level : Quorum


On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak <shrikar84@gmail.com> wrote:

> How about the client side socket limits? Cassandra client side maximum
> connection per host and read consistency level?
>
> ~Shrikar
>
>
> On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> At the client side we are getting a latency of ~350ms, we are using
>> datastax driver 2.0.0 and have kept the fetch size as 500. And these are
>> coming while reading rows having ~200 columns.
>>
>>
>> On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak <shrikar84@gmail.com>wrote:
>>
>>> Hi Apoorva,
>>> As per the cfhistogram there are some rows which have more than 75k
>>> columns and around 150k reads hit 2 SStables.
>>>
>>> Are you sure that you are seeing more than 500ms latency?  The
>>> cfhistogram should the worst read performance was around 51ms
>>> which looks reasonable with many reads hitting 2 sstables.
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <
>>> apoorva.gaurav@myntra.com> wrote:
>>>
>>>> Hello Shrikar,
>>>>
>>>> We are still facing read latency issue, here is the histogram
>>>> http://pastebin.com/yEvMuHYh
>>>>
>>>>
>>>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
>>>> apoorva.gaurav@myntra.com> wrote:
>>>>
>>>>> Hello Shrikar,
>>>>>
>>>>> Yes primary key is (studentID, subjectID). I had dropped the test
>>>>> table, recreating and populating it post which will share the cfhistogram.
>>>>> In such case is there any practical limit on the rows I should fetch, for
>>>>> e.g.
>>>>> should I do
>>>>>        select * form marks_table where studentID = ? limit 500;
>>>>> instead of doing
>>>>>        select * form marks_table where studentID = ?;
>>>>>
>>>>>
>>>>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com>wrote:
>>>>>
>>>>>> Hi Apoorva,
>>>>>>
>>>>>> I assume this is the table with studentId and subjectId  as primary
>>>>>> keys and not other like like marks in that.
>>>>>>
>>>>>> create table marks_table(studentId int, subjectId int, marks int,
>>>>>> PRIMARY KEY(studentId,subjectId));
>>>>>>
>>>>>> Also could you give the cfhistogram stats?
>>>>>>
>>>>>> nodetool cfhistograms <your keyspace> marks_table;
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Shrikar
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>>>>> apoorva.gaurav@myntra.com> wrote:
>>>>>>
>>>>>>> Hello All,
>>>>>>>
>>>>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>>>>> where combination of studentID and subjectID is unique. Number of studentID
>>>>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>>>>> subjectIDs.
>>>>>>>
>>>>>>> We are using apahce cassandra 2.0.4 and datastax java driver
>>>>>>> 1.0.4. We are using a four node cluster, each having 24 cores and 32GB
>>>>>>> memory. I'm sure that the machines are not underperformant as on same test
>>>>>>> bed we've consistently received <5ms response times for ~1b documents when
>>>>>>> queried via primary key.
>>>>>>>
>>>>>>> I've tried three approaches, all of which result in significant
>>>>>>> deterioration (>500 ms response time) in read query performance once number
>>>>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>>>>
>>>>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>>>>> int>) and query by subjectID
>>>>>>>
>>>>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>>>> studentID = ?
>>>>>>>
>>>>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>>>>> query being ~1K.
>>>>>>>
>>>>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>>>>> int, subjct_marks_json text) and query by studentID.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Apoorva
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Apoorva
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Apoorva
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


-- 
Thanks & Regards,
Apoorva

--001a11c2ec320be32e04f61e9467
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">client side socket limit : 64K<div><span style=3D"font-fam=
ily:arial,sans-serif;font-size:13px">client side maximum connection per hos=
t : 8</span><br><div>read consistency level : Quorum=A0</div></div></div><d=
iv class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Thu, Apr 3, 2014 at 12:59 PM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">

<div dir=3D"ltr">How about the client side socket limits? Cassandra client =
side maximum connection per host and read consistency level?<span class=3D"=
HOEnZb"><font color=3D"#888888"><div><br></div><div>~Shrikar</div></font></=
span></div>

<div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><br>=
<div class=3D"gmail_quote">

On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:apoorva.gaurav@myntra.com" target=3D"_blank">apoorva.gaurav@my=
ntra.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir=3D"ltr">At the client side we are getting a latency of ~350ms, we =
are using datastax driver 2.0.0 and have kept the fetch size as 500. And th=
ese are coming while reading rows having ~200 columns.</div><div>

<div><div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Thu, Apr 3, 2014 at 12:45 PM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div dir=3D"ltr">Hi Apoorva,<div>As per the cfhistogram there are some rows=
 which have more than 75k columns and around 150k reads hit 2 SStables.</di=
v><div><br></div><div>Are you sure that you are seeing more than 500ms late=
ncy? =A0The cfhistogram should the worst read performance was around 51ms</=
div>


<div>which looks reasonable with many reads hitting 2 sstables.</div><div><=
br></div><div>Thanks,</div><div>Shrikar</div></div><div><div><div class=3D"=
gmail_extra"><br><br><div class=3D"gmail_quote">

On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:apoorva.gaurav@myntra.com" target=3D"_blank">apoorva.gaurav@my=
ntra.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hello Shrikar,<div><br></di=
v><div>We are still facing read latency issue, here is the histogram=A0<a h=
ref=3D"http://pastebin.com/yEvMuHYh" target=3D"_blank">http://pastebin.com/=
yEvMuHYh</a></div>


</div><div class=3D"gmail_extra"><br>

<br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gau=
rav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" targ=
et=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">


<div dir=3D"ltr">Hello Shrikar,<div><br></div><div>Yes primary key is (stud=
entID, subjectID). I had dropped the test table, recreating and populating =
it post which will share the cfhistogram. In such case is there any practic=
al limit on the rows I should fetch, for e.g.</div>


<div>should I do</div><div>=A0 =A0 =A0 =A0select * form marks_table where s=
tudentID =3D ? limit 500;</div><div>instead of doing=A0</div><div>=A0 =A0 =
=A0 =A0select * form marks_table where studentID =3D ?;<br></div></div><div=
><div>

<div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 5:20 AM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div dir=3D"ltr">Hi Apoorva,<div><br></div><div>I assume this is the table =
with studentId and subjectId =A0as primary keys and not other like like mar=
ks in that.</div><div><br></div><div>create table marks_table(studentId int=
, subjectId int, marks int, PRIMARY KEY(studentId,subjectId));<br>


</div><div><br></div><div>Also could you give the cfhistogram stats?</div><=
div><br></div><div>nodetool cfhistograms &lt;your keyspace&gt; marks_table;=
<br></div><div><br></div><div><br></div><div><br></div><div>Thanks,</div>


<div>Shrikar</div><img src=3D"https://app.getsignals.com/img.gif?ukey=3Dagx=
zfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgKDPyrUJDA&amp;key=3D566c300c-cdd6=
-4caf-825c-dd4355ba609f" width=3D"1" height=3D"1"></div><div><div>
<div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Fri, Mar 28, 2014 at 3:53 PM, Apoorva=
 Gaurav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" =
target=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div dir=3D"ltr">Hello All,<div><br></div><div>We&#39;ve a schema which can=
 be modeled as (studentID, subjectID, marks) where combination of studentID=
 and subjectID is unique. Number of studentID can go up to 100 million and =
for each studentID we can have up to =A010k subjectIDs.=A0</div>


<div><br></div><div>We are using apahce cassandra 2.0.4 and datastax java d=
river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3=
2GB memory.=A0I&#39;m sure that the machines are not underperformant as on =
same test bed we&#39;ve consistently received &lt;5ms response times for ~1=
b documents when queried via primary key.=A0</div>


<div><br></div><div>I&#39;ve tried three approaches, all of which result in=
 significant deterioration (&gt;500 ms response time) in read query perform=
ance once number of subjectIDs goes past ~100 for a studentID. Approaches a=
re :-</div>


<div><br></div><div>1. model as (studentID int PRIMARY KEY, subjectID_marks=
_map map&lt;int, int&gt;) and query by subjectID</div><div><br></div><div>2=
. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,=
 subjectID) and query as select * from marks_table where studentID =3D ?</d=
iv>


<div><br></div><div><div>3. model as (studentID int, subjectID int, marks i=
nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl=
e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec=
tIDs in query being ~1K.<br>


</div><div><br></div><div>What can be the bottlenecks. Is it better if we m=
odel as (studentID int, subjct_marks_json text) and query by studentID.</di=
v><div><span><font color=3D"#888888"><div><br></div>-- <br>Thanks &amp; Reg=
ards,<br>


Apoorva<br>
</font></span></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><div><br><br clear=3D"all"><div><br></div>--=
 <br>Thanks &amp; Regards,<br>Apoorva<br>
</div></div><span><font color=3D"#888888">
</font></span></div></div></blockquote></div><span><font color=3D"#888888">=
<br><br clear=3D"all"><div><br></div>-- <br>Thanks &amp; Regards,<br>Apoorv=
a<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Thanks &amp; Regards,<br>Apoorva<br>
</div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Thanks &amp; Regards,<br>Apoorva<br>
</div>

--001a11c2ec320be32e04f61e9467--