Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of shrikar84@gmail.com designates
 209.85.214.181 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJRvvD8pM4FQNP-WQxAoJHD9fmZSH05LoDvJQXA7NzKwtFmphQ@mail.gmail.com>
References: 
 <CAJRvvD_bvyMpqbAdFRB_Eabi8NUf=5pPk5v0mfim7H=x4gJHyA@mail.gmail.com>
 <CAF4_GfiPUuFeWukJGne8kEX8fFdxT_+6kKywvYkjTYtByTiM8g@mail.gmail.com>
 <CAJRvvD9kCiFeH-d9NDi35ZAuHMUGrd--F50OfY5xkkTt8LJS0g@mail.gmail.com>
 <CAJRvvD8pM4FQNP-WQxAoJHD9fmZSH05LoDvJQXA7NzKwtFmphQ@mail.gmail.com>
From: Shrikar archak <shrikar84@gmail.com>
Date: Thu, 3 Apr 2014 00:15:45 -0700
Message-ID: 
 <CAF4_GfhmF_y2NtO8XWS74M1zwT18MUXmC=e0=YD9tt=RjJff3g@mail.gmail.com>
Subject: Re: Read performance in map data type
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a11346e9aac97ae04f61e2cf0

--001a11346e9aac97ae04f61e2cf0
Content-Type: text/plain; charset=ISO-8859-1

Hi Apoorva,
As per the cfhistogram there are some rows which have more than 75k columns
and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms latency?  The cfhistogram
should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.

Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav
<apoorva.gaurav@myntra.com>wrote:

> Hello Shrikar,
>
> We are still facing read latency issue, here is the histogram
> http://pastebin.com/yEvMuHYh
>
>
> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> Hello Shrikar,
>>
>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>> recreating and populating it post which will share the cfhistogram. In such
>> case is there any practical limit on the rows I should fetch, for e.g.
>> should I do
>>        select * form marks_table where studentID = ? limit 500;
>> instead of doing
>>        select * form marks_table where studentID = ?;
>>
>>
>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com>wrote:
>>
>>> Hi Apoorva,
>>>
>>> I assume this is the table with studentId and subjectId  as primary keys
>>> and not other like like marks in that.
>>>
>>> create table marks_table(studentId int, subjectId int, marks int,
>>> PRIMARY KEY(studentId,subjectId));
>>>
>>> Also could you give the cfhistogram stats?
>>>
>>> nodetool cfhistograms <your keyspace> marks_table;
>>>
>>>
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>> apoorva.gaurav@myntra.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>> where combination of studentID and subjectID is unique. Number of studentID
>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>> subjectIDs.
>>>>
>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>>>> sure that the machines are not underperformant as on same test bed we've
>>>> consistently received <5ms response times for ~1b documents when queried
>>>> via primary key.
>>>>
>>>> I've tried three approaches, all of which result in significant
>>>> deterioration (>500 ms response time) in read query performance once number
>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>
>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>> int>) and query by subjectID
>>>>
>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>> studentID = ?
>>>>
>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>> query being ~1K.
>>>>
>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>> int, subjct_marks_json text) and query by studentID.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Apoorva
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>
>
> --
> Thanks & Regards,
> Apoorva
>

--001a11346e9aac97ae04f61e2cf0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Apoorva,<div>As per the cfhistogram there are some rows=
 which have more than 75k columns and around 150k reads hit 2 SStables.</di=
v><div><br></div><div>Are you sure that you are seeing more than 500ms late=
ncy? =A0The cfhistogram should the worst read performance was around 51ms</=
div>

<div>which looks reasonable with many reads hitting 2 sstables.</div><div><=
br></div><div>Thanks,</div><div>Shrikar</div></div><div class=3D"gmail_extr=
a"><br><br><div class=3D"gmail_quote">On Wed, Apr 2, 2014 at 11:30 PM, Apoo=
rva Gaurav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.co=
m" target=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hello Shrikar,<div><br></di=
v><div>We are still facing read latency issue, here is the histogram=A0<a h=
ref=3D"http://pastebin.com/yEvMuHYh" target=3D"_blank">http://pastebin.com/=
yEvMuHYh</a></div>

</div><div class=3D"gmail_extra"><br>

<br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gau=
rav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" targ=
et=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">


<div dir=3D"ltr">Hello Shrikar,<div><br></div><div>Yes primary key is (stud=
entID, subjectID). I had dropped the test table, recreating and populating =
it post which will share the cfhistogram. In such case is there any practic=
al limit on the rows I should fetch, for e.g.</div>


<div>should I do</div><div>=A0 =A0 =A0 =A0select * form marks_table where s=
tudentID =3D ? limit 500;</div><div>instead of doing=A0</div><div>=A0 =A0 =
=A0 =A0select * form marks_table where studentID =3D ?;<br></div></div><div=
><div>

<div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 5:20 AM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div dir=3D"ltr">Hi Apoorva,<div><br></div><div>I assume this is the table =
with studentId and subjectId =A0as primary keys and not other like like mar=
ks in that.</div><div><br></div><div>create table marks_table(studentId int=
, subjectId int, marks int, PRIMARY KEY(studentId,subjectId));<br>


</div><div><br></div><div>Also could you give the cfhistogram stats?</div><=
div><br></div><div>nodetool cfhistograms &lt;your keyspace&gt; marks_table;=
<br></div><div><br></div><div><br></div><div><br></div><div>Thanks,</div>


<div>Shrikar</div><img src=3D"https://app.getsignals.com/img.gif?ukey=3Dagx=
zfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgKDPyrUJDA&amp;key=3D566c300c-cdd6=
-4caf-825c-dd4355ba609f" width=3D"1" height=3D"1"></div><div><div>
<div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Fri, Mar 28, 2014 at 3:53 PM, Apoorva=
 Gaurav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" =
target=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div dir=3D"ltr">Hello All,<div><br></div><div>We&#39;ve a schema which can=
 be modeled as (studentID, subjectID, marks) where combination of studentID=
 and subjectID is unique. Number of studentID can go up to 100 million and =
for each studentID we can have up to =A010k subjectIDs.=A0</div>


<div><br></div><div>We are using apahce cassandra 2.0.4 and datastax java d=
river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3=
2GB memory.=A0I&#39;m sure that the machines are not underperformant as on =
same test bed we&#39;ve consistently received &lt;5ms response times for ~1=
b documents when queried via primary key.=A0</div>


<div><br></div><div>I&#39;ve tried three approaches, all of which result in=
 significant deterioration (&gt;500 ms response time) in read query perform=
ance once number of subjectIDs goes past ~100 for a studentID. Approaches a=
re :-</div>


<div><br></div><div>1. model as (studentID int PRIMARY KEY, subjectID_marks=
_map map&lt;int, int&gt;) and query by subjectID</div><div><br></div><div>2=
. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,=
 subjectID) and query as select * from marks_table where studentID =3D ?</d=
iv>


<div><br></div><div><div>3. model as (studentID int, subjectID int, marks i=
nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl=
e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec=
tIDs in query being ~1K.<br>


</div><div><br></div><div>What can be the bottlenecks. Is it better if we m=
odel as (studentID int, subjct_marks_json text) and query by studentID.</di=
v><div class=3D""><span><font color=3D"#888888"><div><br></div>-- <br>Thank=
s &amp; Regards,<br>


Apoorva<br>
</font></span></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><div class=3D""><br><br clear=3D"all"><div><=
br></div>-- <br>Thanks &amp; Regards,<br>Apoorva<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">
</font></span></div></div></blockquote></div><span class=3D"HOEnZb"><font c=
olor=3D"#888888"><br><br clear=3D"all"><div><br></div>-- <br>Thanks &amp; R=
egards,<br>Apoorva<br>
</font></span></div>
</blockquote></div><br></div>

--001a11346e9aac97ae04f61e2cf0--