Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of apoorva.gaurav@myntra.com
 designates 209.85.223.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAF4_GfhmF_y2NtO8XWS74M1zwT18MUXmC=e0=YD9tt=RjJff3g@mail.gmail.com>
References: 
 <CAJRvvD_bvyMpqbAdFRB_Eabi8NUf=5pPk5v0mfim7H=x4gJHyA@mail.gmail.com>
 <CAF4_GfiPUuFeWukJGne8kEX8fFdxT_+6kKywvYkjTYtByTiM8g@mail.gmail.com>
 <CAJRvvD9kCiFeH-d9NDi35ZAuHMUGrd--F50OfY5xkkTt8LJS0g@mail.gmail.com>
 <CAJRvvD8pM4FQNP-WQxAoJHD9fmZSH05LoDvJQXA7NzKwtFmphQ@mail.gmail.com>
 <CAF4_GfhmF_y2NtO8XWS74M1zwT18MUXmC=e0=YD9tt=RjJff3g@mail.gmail.com>
From: Apoorva Gaurav <apoorva.gaurav@myntra.com>
Date: Thu, 3 Apr 2014 12:50:11 +0530
Message-ID: 
 <CAJRvvD_dcADZp8smUS+tBH0Mh_tNYvoca00=vk6u0Pth0NmqGA@mail.gmail.com>
Subject: Re: Read performance in map data type
To: user <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=20cf303bf9867ef71d04f61e3c3b

--20cf303bf9867ef71d04f61e3c3b
Content-Type: text/plain; charset=ISO-8859-1

At the client side we are getting a latency of ~350ms, we are using
datastax driver 2.0.0 and have kept the fetch size as 500. And these are
coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak <shrikar84@gmail.com> wrote:

> Hi Apoorva,
> As per the cfhistogram there are some rows which have more than 75k
> columns and around 150k reads hit 2 SStables.
>
> Are you sure that you are seeing more than 500ms latency?  The cfhistogram
> should the worst read performance was around 51ms
> which looks reasonable with many reads hitting 2 sstables.
>
> Thanks,
> Shrikar
>
>
> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> Hello Shrikar,
>>
>> We are still facing read latency issue, here is the histogram
>> http://pastebin.com/yEvMuHYh
>>
>>
>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
>> apoorva.gaurav@myntra.com> wrote:
>>
>>> Hello Shrikar,
>>>
>>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>>> recreating and populating it post which will share the cfhistogram. In such
>>> case is there any practical limit on the rows I should fetch, for e.g.
>>> should I do
>>>        select * form marks_table where studentID = ? limit 500;
>>> instead of doing
>>>        select * form marks_table where studentID = ?;
>>>
>>>
>>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com>wrote:
>>>
>>>> Hi Apoorva,
>>>>
>>>> I assume this is the table with studentId and subjectId  as primary
>>>> keys and not other like like marks in that.
>>>>
>>>> create table marks_table(studentId int, subjectId int, marks int,
>>>> PRIMARY KEY(studentId,subjectId));
>>>>
>>>> Also could you give the cfhistogram stats?
>>>>
>>>> nodetool cfhistograms <your keyspace> marks_table;
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Shrikar
>>>>
>>>>
>>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>>> apoorva.gaurav@myntra.com> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>>> where combination of studentID and subjectID is unique. Number of studentID
>>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>>> subjectIDs.
>>>>>
>>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>>>>> sure that the machines are not underperformant as on same test bed we've
>>>>> consistently received <5ms response times for ~1b documents when queried
>>>>> via primary key.
>>>>>
>>>>> I've tried three approaches, all of which result in significant
>>>>> deterioration (>500 ms response time) in read query performance once number
>>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>>
>>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>>> int>) and query by subjectID
>>>>>
>>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>> studentID = ?
>>>>>
>>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>>> query being ~1K.
>>>>>
>>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>>> int, subjct_marks_json text) and query by studentID.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Apoorva
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


-- 
Thanks & Regards,
Apoorva

--20cf303bf9867ef71d04f61e3c3b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">At the client side we are getting a latency of ~350ms, we =
are using datastax driver 2.0.0 and have kept the fetch size as 500. And th=
ese are coming while reading rows having ~200 columns.</div><div class=3D"g=
mail_extra">

<br><br><div class=3D"gmail_quote">On Thu, Apr 3, 2014 at 12:45 PM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">

<div dir=3D"ltr">Hi Apoorva,<div>As per the cfhistogram there are some rows=
 which have more than 75k columns and around 150k reads hit 2 SStables.</di=
v><div><br></div><div>Are you sure that you are seeing more than 500ms late=
ncy? =A0The cfhistogram should the worst read performance was around 51ms</=
div>


<div>which looks reasonable with many reads hitting 2 sstables.</div><div><=
br></div><div>Thanks,</div><div>Shrikar</div></div><div class=3D"HOEnZb"><d=
iv class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">

On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:apoorva.gaurav@myntra.com" target=3D"_blank">apoorva.gaurav@my=
ntra.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hello Shrikar,<div><br></di=
v><div>We are still facing read latency issue, here is the histogram=A0<a h=
ref=3D"http://pastebin.com/yEvMuHYh" target=3D"_blank">http://pastebin.com/=
yEvMuHYh</a></div>


</div><div class=3D"gmail_extra"><br>

<br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gau=
rav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" targ=
et=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">


<div dir=3D"ltr">Hello Shrikar,<div><br></div><div>Yes primary key is (stud=
entID, subjectID). I had dropped the test table, recreating and populating =
it post which will share the cfhistogram. In such case is there any practic=
al limit on the rows I should fetch, for e.g.</div>


<div>should I do</div><div>=A0 =A0 =A0 =A0select * form marks_table where s=
tudentID =3D ? limit 500;</div><div>instead of doing=A0</div><div>=A0 =A0 =
=A0 =A0select * form marks_table where studentID =3D ?;<br></div></div><div=
><div>

<div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 5:20 AM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div dir=3D"ltr">Hi Apoorva,<div><br></div><div>I assume this is the table =
with studentId and subjectId =A0as primary keys and not other like like mar=
ks in that.</div><div><br></div><div>create table marks_table(studentId int=
, subjectId int, marks int, PRIMARY KEY(studentId,subjectId));<br>


</div><div><br></div><div>Also could you give the cfhistogram stats?</div><=
div><br></div><div>nodetool cfhistograms &lt;your keyspace&gt; marks_table;=
<br></div><div><br></div><div><br></div><div><br></div><div>Thanks,</div>


<div>Shrikar</div><img src=3D"https://app.getsignals.com/img.gif?ukey=3Dagx=
zfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgKDPyrUJDA&amp;key=3D566c300c-cdd6=
-4caf-825c-dd4355ba609f" width=3D"1" height=3D"1"></div><div><div>
<div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Fri, Mar 28, 2014 at 3:53 PM, Apoorva=
 Gaurav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" =
target=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div dir=3D"ltr">Hello All,<div><br></div><div>We&#39;ve a schema which can=
 be modeled as (studentID, subjectID, marks) where combination of studentID=
 and subjectID is unique. Number of studentID can go up to 100 million and =
for each studentID we can have up to =A010k subjectIDs.=A0</div>


<div><br></div><div>We are using apahce cassandra 2.0.4 and datastax java d=
river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3=
2GB memory.=A0I&#39;m sure that the machines are not underperformant as on =
same test bed we&#39;ve consistently received &lt;5ms response times for ~1=
b documents when queried via primary key.=A0</div>


<div><br></div><div>I&#39;ve tried three approaches, all of which result in=
 significant deterioration (&gt;500 ms response time) in read query perform=
ance once number of subjectIDs goes past ~100 for a studentID. Approaches a=
re :-</div>


<div><br></div><div>1. model as (studentID int PRIMARY KEY, subjectID_marks=
_map map&lt;int, int&gt;) and query by subjectID</div><div><br></div><div>2=
. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,=
 subjectID) and query as select * from marks_table where studentID =3D ?</d=
iv>


<div><br></div><div><div>3. model as (studentID int, subjectID int, marks i=
nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl=
e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec=
tIDs in query being ~1K.<br>


</div><div><br></div><div>What can be the bottlenecks. Is it better if we m=
odel as (studentID int, subjct_marks_json text) and query by studentID.</di=
v><div><span><font color=3D"#888888"><div><br></div>-- <br>Thanks &amp; Reg=
ards,<br>


Apoorva<br>
</font></span></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><div><br><br clear=3D"all"><div><br></div>--=
 <br>Thanks &amp; Regards,<br>Apoorva<br>
</div></div><span><font color=3D"#888888">
</font></span></div></div></blockquote></div><span><font color=3D"#888888">=
<br><br clear=3D"all"><div><br></div>-- <br>Thanks &amp; Regards,<br>Apoorv=
a<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Thanks &amp; Regards,<br>Apoorva<br>
</div>

--20cf303bf9867ef71d04f61e3c3b--