Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of rwille@fold3.com designates
 38.101.149.73 as permitted sender)
User-Agent: Microsoft-MacOutlook/14.3.6.130613
Date: Wed, 11 Dec 2013 08:03:44 -0700
Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java
 application?
From: Robert Wille <rwille@fold3.com>
To: <user@cassandra.apache.org>
Message-ID: <CECDCBE7.C1D15%rwille@fold3.com>
Thread-Topic: What is the fastest way to get data into Cassandra 2 from a Java
 application?
In-Reply-To: 
 <CAKkz8Q1H6YWU1rbqxMZwy9UMiQJTOP46r-fCD13D-9nau2jaeQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="B_3469593829_14883644"

--B_3469593829_14883644
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable

Very good point. I=B9ve written code to do a very large number of inserts, bu=
t
I=B9ve only ever run it on a single-node cluster. I may very well find out
when I run it against a multinode cluster that the performance benefits of
large unlogged batches mostly go away.

From:  Sylvain Lebresne <sylvain@datastax.com>
Reply-To:  <user@cassandra.apache.org>
Date:  Wednesday, December 11, 2013 at 6:52 AM
To:  "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject:  Re: What is the fastest way to get data into Cassandra 2 from a
Java application?

On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille <rwille@fold3.com> wrote:
> Network latency is the reason why the batched query is fastest. One trip =
to
> Cassandra versus 1000. If you execute the inserts in parallel, then that
> eliminates the latency issue.

While it is true a batch will means only one client-server round trip, I'll
note that provided you use the TokenAware load balancing policy, doing the
parallelization client will save you intra-replica round-trips, which using
a big batch won't. So that it might not be all that clear which ones is
faster. And very large batches have the disadvantage that your are more
likely to get a timeout (and if you do, you have to retry the whole batch,
even though most of it has probably be inserted correctly). Overall, the
best option probably has to do with parallelizing the inserts of reasonably
sized batches, but what are the sizes for that is likely very use case
dependent, you'll have to test.

--
Sylvain

=20
>=20
> From:  Sylvain Lebresne <sylvain@datastax.com>
> Reply-To:  <user@cassandra.apache.org>
> Date:  Wednesday, December 11, 2013 at 5:40 AM
> To:  "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject:  Re: What is the fastest way to get data into Cassandra 2 from a=
 Java
> application?
>=20
> Then I suspect that this is artifact of your test methodology. Prepared
> statements *are* faster than non prepared ones in general. They save some
> parsing and some bytes on the wire. The savings will tend to be bigger fo=
r
> bigger queries, and it's possible that for very small queries (like the o=
ne
> you
> are testing) the performance difference is somewhat negligible, but seein=
g non
> prepared statement being significantly faster than prepared ones almost s=
urely
> means you're doing wrong (of course, a bug in either the driver or C* is
> always
> possible, and always make sure to test recent versions, but I'm not aware=
 of
> any such bug).
>=20
> Are you sure you are warming up the JVMs (client and drivers) properly fo=
r
> instance. 1000 iterations is *really small*, if you're not warming things
> up properly, you're not measuring anything relevant. Also, are you includ=
ing
> the preparation of the query itself in the timing? Preparing a query is n=
ot
> particulary fast, but it's meant to be done just once at the begining of =
the
> application lifetime. But with only 1000 iterations, if you include the
> preparation in the timing, it's entirely possible it's eating a good chun=
k of
> the whole time.
>=20
> But other prepared versus non-prepared, you won't get proper performance
> unless
> you parallelize your inserts. Unlogged batches is one way to do it (it's
> really
> all Cassandra does with unlogged batch, parallelizing). But as John Sanda
> mentioned, another option is to do the parallelization client side, with
> executeAsync.=20
>=20
> --
> Sylvain
>=20
>=20
>=20
> On Wed, Dec 11, 2013 at 11:37 AM, David Tinker <david.tinker@gmail.com> w=
rote:
>> Yes thats what I found.
>>=20
>> This is faster:
>>=20
>> for (int i =3D 0; i < 1000; i++) session.execute("INSERT INTO
>> test.wibble (id, info) VALUES ('${"" + i}', '${"aa" + i}')")
>>=20
>> Than this:
>>=20
>> def ps =3D session.prepare("INSERT INTO test.wibble (id, info) VALUES (?, =
?)")
>> for (int i =3D 0; i < 1000; i++) session.execute(ps.bind(["" + i, "aa" +
>> i] as Object[]))
>>=20
>> This is the fastest option of all (hand rolled batch):
>>=20
>> StringBuilder b =3D new StringBuilder()
>> b.append("BEGIN UNLOGGED BATCH\n")
>> for (int i =3D 0; i < 1000; i++) {
>>     b.append("INSERT INTO ").append(ks).append(".wibble (id, info)
>> VALUES ('").append(i).append("','")
>>             .append("aa").append(i).append("')\n")
>> }
>> b.append("APPLY BATCH\n")
>> session.execute(b.toString())
>>=20
>>=20
>> On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne <sylvain@datastax.com=
>
>> wrote:
>>> >
>>>> >> This loop takes 2500ms or so on my test cluster:
>>>> >>
>>>> >> PreparedStatement ps =3D session.prepare("INSERT INTO perf_test.wibbl=
e
>>>> >> (id, info) VALUES (?, ?)")
>>>> >> for (int i =3D 0; i < 1000; i++) session.execute(ps.bind("" + i, "aa"=
 +
>>>> i));
>>>> >>
>>>> >> The same loop with the parameters inline is about 1300ms. It gets
>>>> >> worse if there are many parameters.
>>> >
>>> >
>>> > Do you mean that:
>>> >   for (int i =3D 0; i < 1000; i++)
>>> >       session.execute("INSERT INTO perf_test.wibble (id, info) VALUES=
 (" +
i
>>> > + ", aa" + i + ")");
>>> > is twice as fast as using a prepared statement? And that the differen=
ce
>>> > is even greater if you add more columns than "id" and "info"?
>>> >
>>> > That would certainly be unexpected, are you sure you're not re-prepar=
ing
>>> the
>>> > statement every time in the loop?
>>> >
>>> > --
>>> > Sylvain
>>> >
>>>> >> I know I can use batching to
>>>> >> insert all the rows at once but thats not the purpose of this test.=
 I
>>>> >> also tried using session.execute(cql, params) and it is faster but
>>>> >> still doesn't match inline values.
>>>> >>
>>>> >> Composing CQL strings is certainly convenient and simple but is the=
re
>>>> >> a much faster way?
>>>> >>
>>>> >> Thanks
>>>> >> David
>>>> >>
>>>> >> I have also posted this on Stackoverflow if anyone wants the points=
:
>>>> >>
>>>> >>=20
>>>> http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to=
-get-
>>>> data-into-cassandra-2-from-a-java-application
>>> >
>>> >
>>=20
>>=20
>>=20
>> --
>> http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
>> Integration
>=20


--B_3469593829_14883644
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: s=
pace; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size:=
 14px; font-family: Calibri, sans-serif;"><div>Very good point. I&#8217;ve w=
ritten code to do a very large number of inserts, but I&#8217;ve only ever r=
un it on a single-node cluster. I may very well find out when I run it again=
st a multinode cluster that the performance benefits of large unlogged batch=
es mostly go away.</div><div><br></div><span id=3D"OLK_SRC_BODY_SECTION"><div =
style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:black; BO=
RDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PAD=
DING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RI=
GHT: medium none; PADDING-TOP: 3pt"><span style=3D"font-weight:bold">From: </s=
pan> Sylvain Lebresne &lt;<a href=3D"mailto:sylvain@datastax.com">sylvain@data=
stax.com</a>&gt;<br><span style=3D"font-weight:bold">Reply-To: </span> &lt;<a =
href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org</a>&gt;<br=
><span style=3D"font-weight:bold">Date: </span> Wednesday, December 11, 2013 a=
t 6:52 AM<br><span style=3D"font-weight:bold">To: </span> "<a href=3D"mailto:use=
r@cassandra.apache.org">user@cassandra.apache.org</a>" &lt;<a href=3D"mailto:u=
ser@cassandra.apache.org">user@cassandra.apache.org</a>&gt;<br><span style=3D"=
font-weight:bold">Subject: </span> Re: What is the fastest way to get data i=
nto Cassandra 2 from a Java application?<br></div><div><br></div><meta http-=
equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-1"><div dir=3D"ltr">=
On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:rwille@fold3.com" target=3D"_blank">rwille@fold3.com</a>&gt;</span> wrot=
e:<br><div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex"><div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:br=
eak-word"><div>Network latency is the reason why the batched query is fastes=
t. One trip to Cassandra versus 1000. If you execute the inserts in parallel=
, then that eliminates the latency issue.</div></div></blockquote><div><br><=
/div><div>While it is true a batch will means only one client-server round t=
rip, I'll note that provided you use the TokenAware load balancing policy, d=
oing the parallelization client will save you intra-replica round-trips, whi=
ch using a big batch won't. So that it might not be all that clear which one=
s is faster. And very large batches have the disadvantage that your are more=
 likely to get a timeout (and if you do, you have to retry the whole batch, =
even though most of it has probably be inserted correctly). Overall, the bes=
t option probably has to do with parallelizing the inserts of reasonably siz=
ed batches, but what are the sizes for that is likely very use case dependen=
t, you'll have to test.</div><div><br></div><div>--</div><div>Sylvain</div><=
div><br></div><div>&nbsp;</div><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style=3D"font-si=
ze:14px;font-family:Calibri,sans-serif;word-wrap:break-word"><div><br></div>=
<span><div style=3D"border-right:medium none;padding-right:0in;padding-left:0i=
n;padding-top:3pt;text-align:left;font-size:11pt;border-bottom:medium none;f=
ont-family:Calibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-le=
ft:medium none"><span style=3D"font-weight:bold">From: </span> Sylvain Lebresn=
e &lt;<a href=3D"mailto:sylvain@datastax.com" target=3D"_blank">sylvain@datastax=
.com</a>&gt;<br><span style=3D"font-weight:bold">Reply-To: </span> &lt;<a href=
=3D"mailto:user@cassandra.apache.org" target=3D"_blank">user@cassandra.apache.or=
g</a>&gt;<br><span style=3D"font-weight:bold">Date: </span> Wednesday, Decembe=
r 11, 2013 at 5:40 AM<br><span style=3D"font-weight:bold">To: </span> "<a href=
=3D"mailto:user@cassandra.apache.org" target=3D"_blank">user@cassandra.apache.or=
g</a>" &lt;<a href=3D"mailto:user@cassandra.apache.org" target=3D"_blank">user@c=
assandra.apache.org</a>&gt;<br><span style=3D"font-weight:bold">Subject: </spa=
n> Re: What is the fastest way to get data into Cassandra 2 from a Java appl=
ication?<br></div><div><div class=3D"h5"><div><br></div><div dir=3D"ltr"><div>Th=
en I suspect that this is artifact of your test methodology. Prepared</div><=
div>statements *are* faster than non prepared ones in general. They save som=
e</div><div>parsing and some bytes on the wire. The savings will tend to be =
bigger for</div><div>bigger queries, and it's possible that for very small q=
ueries (like the one you</div><div>are testing) the performance difference i=
s somewhat negligible, but seeing non</div><div>prepared statement being sig=
nificantly faster than prepared ones almost surely</div><div>means you're do=
ing wrong (of course, a bug in either the driver or C* is always</div><div>p=
ossible, and always make sure to test recent versions, but I'm not aware of<=
/div><div>any such bug).</div><div><br></div><div>Are you sure you are warmi=
ng up the JVMs (client and drivers) properly for</div><div>
instance. 1000 iterations is *really small*, if you're not warming things</=
div><div>up properly, you're not measuring anything relevant. Also, are you =
including</div><div>the preparation of the query itself in the timing? Prepa=
ring a query is not</div><div>particulary fast, but it's meant to be done ju=
st once at the begining of the</div><div>application lifetime. But with only=
 1000 iterations, if you include the</div><div>preparation in the timing, it=
's entirely possible it's eating a good chunk of</div><div>the whole time.</=
div><div><br></div><div>But other prepared versus non-prepared, you won't ge=
t proper performance unless</div><div>you parallelize your inserts. Unlogged=
 batches is one way to do it (it's really</div><div>all Cassandra does with =
unlogged batch, parallelizing). But as John Sanda</div><div>mentioned, anoth=
er option is to do the parallelization client side, with</div><div>executeAs=
ync.&nbsp;</div><div><br></div><div>--</div><div>
Sylvain</div><div><br></div></div><div class=3D"gmail_extra"><br><br><div cla=
ss=3D"gmail_quote">On Wed, Dec 11, 2013 at 11:37 AM, David Tinker <span dir=3D"l=
tr">&lt;<a href=3D"mailto:david.tinker@gmail.com" target=3D"_blank">david.tinker=
@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Yes thats what=
 I found.<br><br>
This is faster:<br><div><br>
for (int i =3D 0; i &lt; 1000; i++) session.execute("INSERT INTO<br></div>tes=
t.wibble (id, info) VALUES ('${"" + i}', '${"aa" + i}')")<br><br>
Than this:<br><br>
def ps =3D session.prepare("INSERT INTO test.wibble (id, info) VALUES (?, ?)"=
)<br><div>for (int i =3D 0; i &lt; 1000; i++) session.execute(ps.bind(["" + i,=
 "aa" +<br></div>i] as Object[]))<br><br>

This is the fastest option of all (hand rolled batch):<br><br>
StringBuilder b =3D new StringBuilder()<br>
b.append("BEGIN UNLOGGED BATCH\n")<br>
for (int i =3D 0; i &lt; 1000; i++) {<br>
&nbsp; &nbsp; b.append("INSERT INTO ").append(ks).append(".wibble (id, info=
)<br>
VALUES ('").append(i).append("','")<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .append("aa").append(i).append("'=
)\n")<br>
}<br>
b.append("APPLY BATCH\n")<br>
session.execute(b.toString())<br><div><div><br><br>
On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne &lt;<a href=3D"mailto:sylv=
ain@datastax.com" target=3D"_blank">sylvain@datastax.com</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; This loop takes 2500ms or so on my test cluster:<br>
&gt;&gt;<br>
&gt;&gt; PreparedStatement ps =3D session.prepare("INSERT INTO perf_test.wibb=
le<br>
&gt;&gt; (id, info) VALUES (?, ?)")<br>
&gt;&gt; for (int i =3D 0; i &lt; 1000; i++) session.execute(ps.bind("" + i, =
"aa" + i));<br>
&gt;&gt;<br>
&gt;&gt; The same loop with the parameters inline is about 1300ms. It gets<=
br>
&gt;&gt; worse if there are many parameters.<br>
&gt;<br>
&gt;<br>
&gt; Do you mean that:<br>
&gt; &nbsp; for (int i =3D 0; i &lt; 1000; i++)<br>
&gt; &nbsp; &nbsp; &nbsp; session.execute("INSERT INTO perf_test.wibble (id=
, info) VALUES (" + i<br>
&gt; + ", aa" + i + ")");<br>
&gt; is twice as fast as using a prepared statement? And that the differenc=
e<br>
&gt; is even greater if you add more columns than "id" and "info"?<br>
&gt;<br>
&gt; That would certainly be unexpected, are you sure you're not re-prepari=
ng the<br>
&gt; statement every time in the loop?<br>
&gt;<br>
&gt; --<br>
&gt; Sylvain<br>
&gt;<br>
&gt;&gt; I know I can use batching to<br>
&gt;&gt; insert all the rows at once but thats not the purpose of this test=
. I<br>
&gt;&gt; also tried using session.execute(cql, params) and it is faster but=
<br>
&gt;&gt; still doesn't match inline values.<br>
&gt;&gt;<br>
&gt;&gt; Composing CQL strings is certainly convenient and simple but is th=
ere<br>
&gt;&gt; a much faster way?<br>
&gt;&gt;<br>
&gt;&gt; Thanks<br>
&gt;&gt; David<br>
&gt;&gt;<br>
&gt;&gt; I have also posted this on Stackoverflow if anyone wants the point=
s:<br>
&gt;&gt;<br>
&gt;&gt; <a href=3D"http://stackoverflow.com/questions/20491090/what-is-the-f=
astest-way-to-get-data-into-cassandra-2-from-a-java-application" target=3D"_bl=
ank">http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-=
get-data-into-cassandra-2-from-a-java-application</a><br>


&gt;<br>
&gt;<br><br><br><br></div></div><div><div>--<br><a href=3D"http://qdb.io/" ta=
rget=3D"_blank">http://qdb.io/</a> Persistent Message Queues With Replay and #=
RabbitMQ Integration<br></div></div></blockquote></div><br></div></div></div=
></span></div></blockquote></div><br></div></div></span></body></html>

--B_3469593829_14883644--