Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of apoorva.gaurav@myntra.com
 designates 209.85.223.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANF7QJQp9nic+ABO4S9Qcf2Jnf2_xE1gzRnLFLycULQ6RgfCKA@mail.gmail.com>
References: 
 <CAJRvvD_bvyMpqbAdFRB_Eabi8NUf=5pPk5v0mfim7H=x4gJHyA@mail.gmail.com>
 <CAF4_GfiPUuFeWukJGne8kEX8fFdxT_+6kKywvYkjTYtByTiM8g@mail.gmail.com>
 <CAJRvvD9kCiFeH-d9NDi35ZAuHMUGrd--F50OfY5xkkTt8LJS0g@mail.gmail.com>
 <CANF7QJQp9nic+ABO4S9Qcf2Jnf2_xE1gzRnLFLycULQ6RgfCKA@mail.gmail.com>
From: Apoorva Gaurav <apoorva.gaurav@myntra.com>
Date: Sat, 29 Mar 2014 17:13:11 +0530
Message-ID: 
 <CAJRvvD-zFv+hYzL+2VNLo8SaoVdaOGzWZtaApTcwnKyuEPVZjA@mail.gmail.com>
Subject: Re: Read performance in map data type
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf3005dba6dc8d5d04f5bd5322

--20cf3005dba6dc8d5d04f5bd5322
Content-Type: text/plain; charset=ISO-8859-1

Hello Sourabh,

I'd prefer to do query like select * from marks_table where studentID = ?
and subjectID in (?, ?, ?....?) but if its costly then can happily delegate
the responsibility to the application layer.

Haven't tried 2.x java driver for this specific issue but tried it once
earlier and found the performance slower than 1.x; isn't so?


On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal <iitr.sourabh@gmail.com>wrote:

> Hi Apoorva,
>
> Do you always query on studentID only or do you need to query on both
> studentID and subjectID?
>
> Also, I think using the latest driver (2.x) can make querying large number
> of rows efficient.
> http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
>
>
>
>
> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <apoorva.gaurav@myntra.com
> > wrote:
>
>> Hello Shrikar,
>>
>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>> recreating and populating it post which will share the cfhistogram. In such
>> case is there any practical limit on the rows I should fetch, for e.g.
>> should I do
>>        select * form marks_table where studentID = ? limit 500;
>> instead of doing
>>        select * form marks_table where studentID = ?;
>>
>>
>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrikar84@gmail.com>wrote:
>>
>>> Hi Apoorva,
>>>
>>> I assume this is the table with studentId and subjectId  as primary keys
>>> and not other like like marks in that.
>>>
>>> create table marks_table(studentId int, subjectId int, marks int,
>>> PRIMARY KEY(studentId,subjectId));
>>>
>>> Also could you give the cfhistogram stats?
>>>
>>> nodetool cfhistograms <your keyspace> marks_table;
>>>
>>>
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>> apoorva.gaurav@myntra.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>> where combination of studentID and subjectID is unique. Number of studentID
>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>> subjectIDs.
>>>>
>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>>>> sure that the machines are not underperformant as on same test bed we've
>>>> consistently received <5ms response times for ~1b documents when queried
>>>> via primary key.
>>>>
>>>> I've tried three approaches, all of which result in significant
>>>> deterioration (>500 ms response time) in read query performance once number
>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>
>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>> int>) and query by subjectID
>>>>
>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>> studentID = ?
>>>>
>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>> query being ~1K.
>>>>
>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>> int, subjct_marks_json text) and query by studentID.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Apoorva
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>
>
> --
> Sourabh Agrawal
> Bangalore
> +91 9945657973
>


-- 
Thanks & Regards,
Apoorva

--20cf3005dba6dc8d5d04f5bd5322
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hello Sourabh,</div><div><br></div>I&#39;d prefer to =
do query like=A0<span style=3D"font-family:arial,sans-serif;font-size:13px"=
>select * from marks_table where studentID =3D ? and subjectID in (?, ?, ?.=
...?) but if its costly then can happily delegate the responsibility to the=
 application layer.</span><div>

<span style=3D"font-family:arial,sans-serif;font-size:13px"><br></span></di=
v><div><span style=3D"font-family:arial,sans-serif;font-size:13px">Haven=
9;t tried 2.x java driver for this specific issue but tried it once earlier=
 and found the performance slower than 1.x; isn&#39;t so?</span></div>

</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Sat,=
 Mar 29, 2014 at 3:30 PM, Sourabh Agrawal <span dir=3D"ltr">&lt;<a href=3D"=
mailto:iitr.sourabh@gmail.com" target=3D"_blank">iitr.sourabh@gmail.com</a>=
&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Apoorva,<div><br></div><=
div>Do you always query on studentID only or do you need to query on both s=
tudentID and subjectID? =A0</div>

<div><br></div><div>Also, I think using the latest driver (2.x) can make qu=
erying large number of rows efficient.=A0</div>
<div><a href=3D"http://www.datastax.com/dev/blog/client-side-improvements-i=
n-cassandra-2-0" target=3D"_blank">http://www.datastax.com/dev/blog/client-=
side-improvements-in-cassandra-2-0</a></div><div><br></div><div><br></div>

</div><div class=3D"gmail_extra"><div><div class=3D"h5">
<br><br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 8:11 AM, Apoorva=
 Gaurav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" =
target=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div dir=3D"ltr">Hello Shrikar,<div><br></div><div>Yes primary key is (stud=
entID, subjectID). I had dropped the test table, recreating and populating =
it post which will share the cfhistogram. In such case is there any practic=
al limit on the rows I should fetch, for e.g.</div>


<div>should I do</div><div>=A0 =A0 =A0 =A0select * form marks_table where s=
tudentID =3D ? limit 500;</div><div>instead of doing=A0</div><div>=A0 =A0 =
=A0 =A0select * form marks_table where studentID =3D ?;<br></div></div><div=
><div>
<div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Sat, Mar 29, 2014 at 5:20 AM, Shrikar=
 archak <span dir=3D"ltr">&lt;<a href=3D"mailto:shrikar84@gmail.com" target=
=3D"_blank">shrikar84@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div dir=3D"ltr">Hi Apoorva,<div><br></div><div>I assume this is the table =
with studentId and subjectId =A0as primary keys and not other like like mar=
ks in that.</div><div><br></div><div>create table marks_table(studentId int=
, subjectId int, marks int, PRIMARY KEY(studentId,subjectId));<br>


</div><div><br></div><div>Also could you give the cfhistogram stats?</div><=
div><br></div><div>nodetool cfhistograms &lt;your keyspace&gt; marks_table;=
<br></div><div><br></div><div><br></div><div><br></div><div>Thanks,</div>


<div>Shrikar</div><img src=3D"https://app.getsignals.com/img.gif?ukey=3Dagx=
zfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgKDPyrUJDA&amp;key=3D566c300c-cdd6=
-4caf-825c-dd4355ba609f" width=3D"1" height=3D"1"></div><div><div>

<div class=3D"gmail_extra">

<br><br><div class=3D"gmail_quote">On Fri, Mar 28, 2014 at 3:53 PM, Apoorva=
 Gaurav <span dir=3D"ltr">&lt;<a href=3D"mailto:apoorva.gaurav@myntra.com" =
target=3D"_blank">apoorva.gaurav@myntra.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div dir=3D"ltr">Hello All,<div><br></div><div>We&#39;ve a schema which can=
 be modeled as (studentID, subjectID, marks) where combination of studentID=
 and subjectID is unique. Number of studentID can go up to 100 million and =
for each studentID we can have up to =A010k subjectIDs.=A0</div>


<div><br></div><div>We are using apahce cassandra 2.0.4 and datastax java d=
river 1.0.4.=A0We are using a four node cluster, each having 24 cores and 3=
2GB memory.=A0I&#39;m sure that the machines are not underperformant as on =
same test bed we&#39;ve consistently received &lt;5ms response times for ~1=
b documents when queried via primary key.=A0</div>


<div><br></div><div>I&#39;ve tried three approaches, all of which result in=
 significant deterioration (&gt;500 ms response time) in read query perform=
ance once number of subjectIDs goes past ~100 for a studentID. Approaches a=
re :-</div>


<div><br></div><div>1. model as (studentID int PRIMARY KEY, subjectID_marks=
_map map&lt;int, int&gt;) and query by subjectID</div><div><br></div><div>2=
. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID,=
 subjectID) and query as select * from marks_table where studentID =3D ?</d=
iv>


<div><br></div><div><div>3. model as (studentID int, subjectID int, marks i=
nt, PRIMARY KEY(studentID, subjectID) and query as select * from marks_tabl=
e where studentID =3D ? and subjectID in (?, ?, ?....?) =A0number of subjec=
tIDs in query being ~1K.<br>


</div><div><br></div><div>What can be the bottlenecks. Is it better if we m=
odel as (studentID int, subjct_marks_json text) and query by studentID.</di=
v><span><font color=3D"#888888"><div><br></div>-- <br>Thanks &amp; Regards,=
<br>


Apoorva<br>
</font></span></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Thanks &amp; Regards,<br>Apoorva<br>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br><div dir=3D"ltr"=
>Sourabh Agrawal<div>Bangalore</div><div>+91 9945657973</div></div>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Thanks &amp;=
 Regards,<br>Apoorva<br>
</div>

--20cf3005dba6dc8d5d04f5bd5322--