Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <CAD=QvuLoXTMBsy2FwiARfhAnQrZqjzwXFALUtsh4rCQe3NK0tg@mail.gmail.com>
References: <CAD=Qvu+LyMSWgFFUXzHy8Y_AYoSiBwAQY2QwuQUneBKdsTbBEg@mail.gmail.com>
 <CAKUaUn6gPEJffMu2nwMrVdy+hQOKcnCo8rQxgdwkw1=EZY7mYQ@mail.gmail.com> <CAD=QvuLoXTMBsy2FwiARfhAnQrZqjzwXFALUtsh4rCQe3NK0tg@mail.gmail.com>
From: Dmitry Saprykin <saprykin.dmitry@gmail.com>
Date: Thu, 17 Aug 2017 16:01:37 -0400
Message-ID: <CAHWt7w2dOT6xOAPt4QN255SGKgpZgzrnteojtAOXzwmYu30eDw@mail.gmail.com>
Subject: Re: Full table scan with cassandra
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="001a113ddb7629fee50556f87c9d"
archived-at: Thu, 17 Aug 2017 20:02:06 -0000

--001a113ddb7629fee50556f87c9d
Content-Type: text/plain; charset="UTF-8"

Hi Alex,

How do you generate you subrange set for running queries?
It may happen that some of your ranges intersect data ownership range
borders (check it running 'nodetool describering [keyspace_name]')
Those range queries will be highly ineffective in that case and that could
explain your results.

Also you can think using LOCAL_ONE consistency for your full scans. You may
lose some consistency but will gain a log of performance improvements.

Kind regards,
Dmitry Saprykin

On Thu, Aug 17, 2017 at 12:36 PM, Alex Kotelnikov <
alex.kotelnikov@diginetica.com> wrote:

> Dor,
>
> I believe, I tried it in many ways and the result is quite disappointing.
> I've run my scans on 3 different clusters, one of which was using on VMs
> and I was able to scale it up and down (3-5-7 VMs, 8 to 24 cores) to see,
> how this affects the performance.
>
> I also generated the flow from spark cluster ranging from 4 to 40 parallel
> tasks as well as just multi-threaded client.
>
> The surprise is that trivial fetch of all records using token ranges takes
> pretty much the same time in all setups.
>
> The only beneficial thing I've learned is that it is much more efficient
> to create a MATERIALIZED VIEW than to filter (even using secondary index).
>
> Say, I have a typical dataset, around 3Gb of data, 1M records. And I have
> a trivial scan practice:
>
> String.format("SELECT token(user_id), user_id, events FROM user_events
> WHERE token(user_id) >= %d ", start) + (end != null ? String.format(" AND
> token(user_id) < %d ", end) : "")
>
> I split all tokens into start-end ranges (except for last range, which
> only has start) and query ranges in multiple threads, up to 40.
>
> Whole process takes ~40s on 3 VMs cluster  2+2+4 cores, 16Gb RAM each 1
> virtual disk. And it takes ~30s on real hardware clusters
> 8servers*8cores*32Gb. Level of the concurrency does not matter pretty much
> at all. Util it is too high or too low.
> Size of tokens range matters, but here I see the rule "make it larger, but
> avoid cassandra timeouts".
> I also tried spark connector to validate that my test multithreaded app is
> not the bottleneck. It is not.
>
> I expected some kind of elasticity, I see none. Feels like I do something
> wrong...
>
>
>
> On 17 August 2017 at 00:19, Dor Laor <dor@scylladb.com> wrote:
>
>> Hi Alex,
>>
>> You probably didn't get the paralelism right. Serial scan has
>> a paralelism of one. If the paralelism isn't large enough, perf will be
>> slow.
>> If paralelism is too large, Cassandra and the disk will trash and have too
>> many context switches.
>>
>> So you need to find your cluster's sweet spot. We documented the procedure
>> to do it in this blog: http://www.scylladb.com/
>> 2017/02/13/efficient-full-table-scans-with-scylla-1-6/
>> and the results are here: http://www.scylladb.com/
>> 2017/03/28/parallel-efficient-full-table-scan-scylla/
>> The algorithm should translate to Cassandra but you'll have to use
>> different rules of the thumb.
>>
>> Best,
>> Dor
>>
>>
>> On Wed, Aug 16, 2017 at 9:50 AM, Alex Kotelnikov <
>> alex.kotelnikov@diginetica.com> wrote:
>>
>>> Hey,
>>>
>>> we are trying Cassandra as an alternative for storage huge stream of
>>> data coming from our customers.
>>>
>>> Storing works quite fine, and I started to validate how retrieval does.
>>> We have two types of that: fetching specific records and bulk retrieval for
>>> general analysis.
>>> Fetching single record works like charm. But it is not so with bulk
>>> fetch.
>>>
>>> With a moderately small table of ~2 million records, ~10Gb raw data I
>>> observed very slow operation (using token(partition key) ranges). It takes
>>> minutes to perform full retrieval. We tried a couple of configurations
>>> using virtual machines, real hardware and overall looks like it is not
>>> possible to all table data in a reasonable time (by reasonable I mean that
>>> since we have 1Gbit network 10Gb can be transferred in a couple of minutes
>>> from one server to another and when we have 10+ cassandra servers and 10+
>>> spark executors total time should be even smaller).
>>>
>>> I tried datastax spark connector. Also I wrote a simple test case using
>>> datastax java driver and see how fetch of 10k records takes ~10s so I
>>> assume that "sequential" scan will take 200x more time, equals ~30 minutes.
>>>
>>> May be we are totally wrong trying to use Cassandra this way?
>>>
>>> --
>>>
>>> Best Regards,
>>>
>>>
>>> *Alexander Kotelnikov*
>>>
>>> *Team Lead*
>>>
>>> DIGINETICA
>>> Retail Technology Company
>>>
>>> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>>>
>>> *www.diginetica.com <http://www.diginetica.com/>*
>>>
>>
>>
>
>
> --
>
> Best Regards,
>
>
> *Alexander Kotelnikov*
>
> *Team Lead*
>
> DIGINETICA
> Retail Technology Company
>
> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>
> *www.diginetica.com <http://www.diginetica.com/>*
>

--001a113ddb7629fee50556f87c9d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Alex,<div><br></div><div>How do you generate you subran=
ge set for running queries?</div><div>It may happen that some of your range=
s intersect data ownership range borders (check it running &#39;nodetool de=
scribering [keyspace_name]&#39;)</div><div>Those range queries will be high=
ly ineffective in that case and that could explain your results.</div><div>=
<br></div><div>Also you can think using=C2=A0LOCAL_ONE consistency for your=
 full scans. You may lose some consistency but will gain a log of performan=
ce improvements.</div><div><br></div><div>Kind regards,</div><div>Dmitry Sa=
prykin</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
>On Thu, Aug 17, 2017 at 12:36 PM, Alex Kotelnikov <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:alex.kotelnikov@diginetica.com" target=3D"_blank">alex.kote=
lnikov@diginetica.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x"><div dir=3D"ltr">Dor,<div><br></div><div>I believe, I tried it in many w=
ays and the result is quite disappointing.</div><div>I&#39;ve run my scans =
on 3 different clusters, one of which was using on VMs and I was able to sc=
ale it up and down (3-5-7 VMs, 8 to 24 cores) to see, how this affects the =
performance.</div><div><br></div><div>I also generated the flow from spark =
cluster ranging from 4 to 40 parallel tasks as well as just multi-threaded =
client.</div><div><br></div><div>The surprise is that trivial fetch of all =
records using token ranges takes pretty much the same time in all setups.=
=C2=A0</div><div><br></div><div>The only beneficial thing I&#39;ve learned =
is that it is much more efficient to create a MATERIALIZED VIEW than to fil=
ter (even using secondary index).</div><div><br></div><div>Say, I have a ty=
pical dataset, around 3Gb of data, 1M records. And I have a trivial scan pr=
actice:</div><div><br></div><div><div>String.format(&quot;SELECT token(user=
_id), user_id, events FROM user_events WHERE token(user_id) &gt;=3D %d &quo=
t;, start) + (end !=3D null ? String.format(&quot; AND token(user_id) &lt; =
%d &quot;, end) : &quot;&quot;)</div></div><div><br></div><div>I split all =
tokens into start-end ranges (except for last range, which only has start) =
and query ranges in multiple threads, up to 40.</div><div><br></div><div>Wh=
ole process takes ~40s on 3 VMs cluster =C2=A02+2+4 cores, 16Gb RAM each 1 =
virtual disk. And it takes ~30s on real hardware clusters 8servers*8cores*3=
2Gb. Level of the concurrency does not matter pretty much at all. Util it i=
s too high or too low.</div><div>Size of tokens range matters, but here I s=
ee the rule &quot;make it larger, but avoid cassandra timeouts&quot;.</div>=
<div>I also tried spark connector to validate that my test multithreaded ap=
p is not the bottleneck. It is not.</div><div><br></div><div>I expected som=
e kind of elasticity, I see none. Feels like I do something wrong...</div><=
div><br></div><div><br></div><div class=3D"gmail_extra"><br><div class=3D"g=
mail_quote">On 17 August 2017 at 00:19, Dor Laor <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:dor@scylladb.com" target=3D"_blank">dor@scylladb.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Alex,<d=
iv><br></div><div>You probably didn&#39;t get the paralelism right. Serial =
scan has</div><div>a paralelism of one. If the paralelism isn&#39;t large e=
nough, perf will be slow.</div><div>If paralelism is too large, Cassandra a=
nd the disk will trash and have too</div><div>many context switches.=C2=A0<=
/div><div><br></div><div>So you need to find your cluster&#39;s sweet spot.=
 We documented the procedure</div><div>to do it in this blog:=C2=A0<a href=
=3D"http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scyl=
la-1-6/" target=3D"_blank">http://www.scylladb.com/<wbr>2017/02/13/efficien=
t-full-tabl<wbr>e-scans-with-scylla-1-6/</a></div><div>and the results are =
here:=C2=A0<a href=3D"http://www.scylladb.com/2017/03/28/parallel-efficient=
-full-table-scan-scylla/" target=3D"_blank">http://www.scylladb.com/<wbr>20=
17/03/28/parallel-efficient-<wbr>full-table-scan-scylla/</a></div><div>The =
algorithm should translate to Cassandra but you&#39;ll have to use differen=
t rules of the thumb.</div><div><br></div><div>Best,</div><div>Dor</div><di=
v><br></div></div><div class=3D"m_1553678913550824810m_2305269906026666001H=
OEnZb"><div class=3D"m_1553678913550824810m_2305269906026666001h5"><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Aug 16, 2017 at 9=
:50 AM, Alex Kotelnikov <span dir=3D"ltr">&lt;<a href=3D"mailto:alex.koteln=
ikov@diginetica.com" target=3D"_blank">alex.kotelnikov@diginetica.co<wbr>m<=
/a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">He=
y,=C2=A0<div><br></div><div>we are trying Cassandra as an alternative for s=
torage huge stream of data coming from our customers.</div><div><br></div><=
div>Storing works quite fine, and I started to validate how retrieval does.=
 We have two types of that: fetching specific records and bulk retrieval fo=
r general analysis.</div><div>Fetching single record works like charm. But =
it is not so with bulk fetch.</div><div><br></div><div>With a moderately sm=
all table of ~2 million records, ~10Gb raw data I observed very slow operat=
ion (using token(partition key) ranges). It takes minutes to perform full r=
etrieval. We tried a couple of configurations using virtual machines, real =
hardware and overall looks like it is not possible to all table data in a r=
easonable time (by reasonable I mean that since we have 1Gbit network 10Gb =
can be transferred in a couple of minutes from one server to another and wh=
en we have 10+ cassandra servers and 10+ spark executors total time should =
be even smaller).</div><div><br></div><div>I tried datastax spark connector=
. Also I wrote a simple test case using datastax java driver and see how fe=
tch of 10k records takes ~10s so I assume that &quot;sequential&quot; scan =
will take 200x more time, equals ~30 minutes.</div><div><br></div><div>May =
be we are totally wrong trying to use Cassandra this way?</div><div><div><b=
r></div>-- <br><div class=3D"m_1553678913550824810m_2305269906026666001m_-5=
252190780778302242m_6402422997006147657gmail-m_4374609356387613800gmail_sig=
nature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><=
br>Best Regards,<br><br><b>Alexander Kotelnikov<br></b></div><b>Team Lead<b=
r></b><div><span style=3D"font-size:12.8px"><p dir=3D"ltr" style=3D"font-si=
ze:12.8px;line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D=
"font-size:17.3333px;font-family:calibri;color:rgb(255,0,255);font-weight:7=
00;vertical-align:baseline;background-color:transparent">DIGI</span><span s=
tyle=3D"font-size:17.3333px;font-family:calibri;color:rgb(0,0,0);font-weigh=
t:700;vertical-align:baseline;background-color:transparent">NETICA</span></=
p><font color=3D"#999999" size=3D"2">Retail Technology Company<br><br></fon=
t><p dir=3D"ltr" style=3D"font-size:12.8px;line-height:1.38;margin-top:0pt;=
margin-bottom:0pt"><span style=3D"font-size:12px;font-family:arial;color:rg=
b(255,0,255);font-weight:700;vertical-align:baseline;background-color:trans=
parent">m</span><span style=3D"font-size:12px;font-family:arial;color:rgb(0=
,0,0);vertical-align:baseline;background-color:transparent">: <a href=3D"te=
l:+7%20921%20915-06-28" value=3D"+79219150628" target=3D"_blank">+7.921.915=
.06.28</a> <br></span></p></span><p dir=3D"ltr" style=3D"font-size:12.8px;l=
ine-height:1.38;margin-top:0pt;margin-bottom:0pt"><u style=3D"color:rgb(0,0=
,255);font-family:arial;font-size:small;line-height:normal"><a href=3D"http=
://www.diginetica.com/" style=3D"color:rgb(17,85,204)" target=3D"_blank">ww=
w.diginetica.com</a></u></p></div></div></div></div></div></div></div>
</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"m_1553678913550824810m_2305269906026666001gmail_signature" da=
ta-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><di=
v><div dir=3D"ltr"><div><br>Best Regards,<br><br><b>Alexander Kotelnikov<br=
></b></div><b>Team Lead<br></b><div><span style=3D"font-size:12.8px"><p dir=
=3D"ltr" style=3D"font-size:12.8px;line-height:1.38;margin-top:0pt;margin-b=
ottom:0pt"><span style=3D"font-size:17.3333px;font-family:calibri;color:rgb=
(255,0,255);font-weight:700;vertical-align:baseline;background-color:transp=
arent">DIGI</span><span style=3D"font-size:17.3333px;font-family:calibri;co=
lor:rgb(0,0,0);font-weight:700;vertical-align:baseline;background-color:tra=
nsparent">NETICA</span></p><font color=3D"#999999" size=3D"2">Retail Techno=
logy Company<br><br></font><p dir=3D"ltr" style=3D"font-size:12.8px;line-he=
ight:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:12px;f=
ont-family:arial;color:rgb(255,0,255);font-weight:700;vertical-align:baseli=
ne;background-color:transparent">m</span><span style=3D"font-size:12px;font=
-family:arial;color:rgb(0,0,0);vertical-align:baseline;background-color:tra=
nsparent">: <a href=3D"tel:+7%20921%20915-06-28" value=3D"+79219150628" tar=
get=3D"_blank">+7.921.915.06.28</a> <br></span></p></span><p dir=3D"ltr" st=
yle=3D"font-size:12.8px;line-height:1.38;margin-top:0pt;margin-bottom:0pt">=
<u style=3D"color:rgb(0,0,255);font-family:arial;font-size:small;line-heigh=
t:normal"><a href=3D"http://www.diginetica.com/" style=3D"color:rgb(17,85,2=
04)" target=3D"_blank">www.diginetica.com</a></u></p></div></div></div></di=
v></div></div></div>
</div></div>
</blockquote></div><br></div>

--001a113ddb7629fee50556f87c9d--