Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <CAAam9suYs5wHCrgy_WPwALjqeCijbD4fFrNeoeFTuy8x90heLQ@mail.gmail.com>
References: <CAG0vsSKyQ2=qM-UphiJeTSC0BUA2_XYE-H6WjsL67KLJDbDaPw@mail.gmail.com>
 <CAAam9suYs5wHCrgy_WPwALjqeCijbD4fFrNeoeFTuy8x90heLQ@mail.gmail.com>
From: Atul Saroha <atul.saroha@snapdeal.com>
Date: Wed, 18 May 2016 17:28:22 +0530
Message-ID: <CAG0vsS+bkfbTiS2+2NuVUMx8JVtDo0pcC0VcPJRsgFrfGVQMww@mail.gmail.com>
Subject: Re: Low cardinality secondary index behaviour
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a1133ac902fb36105331c9436
archived-at: Wed, 18 May 2016 11:58:55 -0000

--001a1133ac902fb36105331c9436
Content-Type: text/plain; charset=UTF-8

Thanks Tyler,

SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
3.0.6 now.

---------------------------------------------------------------------------------------------------------------------
Atul Saroha
*Lead Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <tyler@datastax.com> wrote:

>
> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <atul.saroha@snapdeal.com>
> wrote:
>
>> I have concern over using secondary index on field with low cardinality.
>> Lets say I have few billion rows and each row can be classified in 1000
>> category. Lets say we have 50 node cluster.
>>
>> Now we want to fetch data for a single category using secondary index
>> over a category. And query is paginated too with fetch size property say
>> 5000.
>>
>> Since query on secondary index works as scatter and gatherer approach by
>> coordinator node. Would it lead to out of memory on coordinator or timeout
>> errors too much.
>>
>
> Paging will prevent the coordinator from using excessive memory.  With the
> type of data that you described, timeouts shouldn't be huge problem because
> it will only take a few token ranges (assuming you're using vnodes) to get
> enough matching rows to hit the page size.
>
>
>>
>> How does pagination (token level data fetch) behave in scatter and
>> gatherer approach?
>>
>
> Secondary index queries fetch token ranges in sequential order [1],
> starting with the minimum token.  When you fetch a new page, it resumes
> from the last token (and primary key) that it returned in the previous page.
>
> [1] As an optimization, multiple token ranges will be fetched in parallel
> based on estimates of how many token ranges it will take to fill the page.
>
>
>>
>> Secondly, What If we create an inverted table with partition key as
>> category. Then this will led to lots of data on single node. Then it might
>> led to hot shard issue and performance issue of data fetching from single
>> node as a single partition has  millions of rows.
>>
>> How should we tackle such low cardinality index in Cassandra?
>
>
> The data distribution that you described sounds like a reasonable fit for
> secondary indexes.  However, I would also take into account how frequently
> you run this query and how fast you need it to be.  Even ignoring the
> scatter-gather aspects of a secondary index query, they are still expensive
> because they fetch many non-contiguous rows from an SSTable.  If you need
> to run this query very frequently, that may add too much load to your
> cluster, and some sort of inverted table approach may be more appropriate.
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

--001a1133ac902fb36105331c9436
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Thanks Tyler,<br><br></div>SPARSE SASI index solves m=
y use case. Planing to upgrade the cassandra to 3.0.6 now.<br></div><div cl=
ass=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_signature"><=
div dir=3D"ltr"><div class=3D"gmail_signature"><div dir=3D"ltr"><div><div d=
ir=3D"ltr"><div><div dir=3D"ltr"><span><div style=3D"font-weight:bold;font-=
stretch:normal;font-size:15px;line-height:16px;font-family:Arial,Helvetica,=
sans-serif;color:rgb(251,2,41)"><span style=3D"color:rgb(178,0,0)">--------=
---------------------------------------------------------------------------=
----------------------------------<br>Atul Saroha</span><span><span><img sr=
c=3D"http://i1.sdlcdn.com/img/marketing-mailers/mailer/2015/MKT/refer_23dec=
/images/sd_logo_23dec.png" align=3D"right" height=3D"72" hspace=3D"0" vspac=
e=3D"0" width=3D"219"></span></span><br style=3D"color:rgb(0,0,0)"><span st=
yle=3D"font:bold 11px/16px Arial,Helvetica,sans-serif;color:#000000"><b>Lea=
d Software Engineer</b></span><br></div><span style=3D"font:normal 11px/16p=
x Arial,Helvetica,sans-serif;color:#8b8b8b"><b style=3D"color:#747474">M</b=
>: +91 8447784271=C2=A0</span><span style=3D"font:normal 11px/16px Arial,He=
lvetica,sans-serif;color:#8b8b8b"><span style=3D"font-stretch:normal;font-s=
ize:11px;line-height:16px;font-family:Arial,Helvetica,sans-serif;color:rgb(=
139,139,139)"><b>T</b>: +91 124-415-6069 <b>EXT</b>: 12369</span></span><sp=
an style=3D"font:normal 11px/16px Arial,Helvetica,sans-serif;color:#8b8b8b"=
><br></span><span style=3D"font-stretch:normal;font-size:11px;line-height:1=
6px;font-family:Arial,Helvetica,sans-serif;color:rgb(139,139,139)">Plot # 3=
62, ASF Centre - Tower A, Udyog Vihar,<br>=C2=A0Phase -4, Sector 18, Gurgao=
n, Haryana 122016, INDIA</span></span></div></div></div></div></div></div><=
/div></div></div>
<br><div class=3D"gmail_quote">On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:tyler@datastax.com" target=3D"_bla=
nk">tyler@datastax.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><div dir=3D"ltr"><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te"><span class=3D"">On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <span dir=
=3D"ltr">&lt;<a href=3D"mailto:atul.saroha@snapdeal.com" target=3D"_blank">=
atul.saroha@snapdeal.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex"><div><div><div>I have concern over using secondary index on field wit=
h low cardinality. Lets say I have few billion rows and each row can be cla=
ssified in 1000 category. Lets say we have 50 node cluster.<br><br></div>No=
w we want to fetch data for a single category using secondary index over a =
category. And query is paginated too with fetch size property say 5000. <br=
><br>Since query on secondary index works as scatter and gatherer approach =
by coordinator node. Would it lead to out of memory on coordinator or timeo=
ut errors too much.<br></div></div></blockquote><div><br></div></span><div>=
Paging will prevent the coordinator from using excessive memory.=C2=A0 With=
 the type of data that you described, timeouts shouldn&#39;t be huge proble=
m because it will only take a few token ranges (assuming you&#39;re using v=
nodes) to get enough matching rows to hit the page size.<br></div><span cla=
ss=3D""><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><br>How d=
oes pagination (token level data fetch) behave in scatter and gatherer appr=
oach?<br></div></div></blockquote><div><br></div></span><div>Secondary inde=
x queries fetch token ranges in sequential order [1], starting with the min=
imum token.=C2=A0 When you fetch a new page, it resumes from the last token=
 (and primary key) that it returned in the previous page.<br><br></div><div=
>[1] As an optimization, multiple token ranges will be fetched in parallel =
based on estimates of how many token ranges it will take to fill the page.<=
br></div><span class=3D""><div>=C2=A0</div><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div><div><br></div>Secondly, What If we create an inverted table with parti=
tion key as category. Then this will led to lots of data on single node. Th=
en it might led to hot shard issue and performance issue of data fetching f=
rom single node as a single partition has=C2=A0 millions of rows.<br><br></=
div>How should we tackle such low cardinality index in Cassandra?</blockquo=
te></span></div><br></div><div class=3D"gmail_extra">The data distribution =
that you described sounds like a reasonable fit for secondary indexes.=C2=
=A0 However, I would also take into account how frequently you run this que=
ry and how fast you need it to be.=C2=A0 Even ignoring the scatter-gather a=
spects of a secondary index query, they are still expensive because they fe=
tch many non-contiguous rows from an SSTable.=C2=A0 If you need to run this=
 query very frequently, that may add too much load to your cluster, and som=
e sort of inverted table approach may be more appropriate.<span class=3D"HO=
EnZb"><font color=3D"#888888"><br clear=3D"all"></font></span></div><span c=
lass=3D"HOEnZb"><font color=3D"#888888"><div class=3D"gmail_extra"><br>-- <=
br><div><font color=3D"#888888">Tyler Hobbs<span></span><br>
<a href=3D"http://datastax.com/" target=3D"_blank">DataStax</a><br></font><=
/div>
</div></font></span></div>
</blockquote></div><br></div>

--001a1133ac902fb36105331c9436--