Mailing-List: contact user-help@kudu.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@kudu.incubator.apache.org
Date: Mon, 04 Jul 2016 17:46:24 +0800
From: "=?UTF-8?B?6KKB5bq377yI5qKT5oKg77yJ?=" <yuankang.yk@alibaba-inc.com>
To: "user" <user@kudu.incubator.apache.org>
Reply-To: "=?UTF-8?B?6KKB5bq377yI5qKT5oKg77yJ?=" <yuankang.yk@alibaba-inc.com>
Message-ID: <48ae4858-34bc-4456-b12f-4018236e53fb.yuankang.yk@alibaba-inc.com>
Subject: =?UTF-8?B?5Zue5aSN77yaUGVyZm9ybWFuY2UgUXVlc3Rpb24=?=
MIME-Version: 1.0
References: <55B8BF95-5704-46CA-A336-64EE4D2B91B2@gmail.com> <CADXBggeOdWwM5gxyfBUc5pipEhUCaSM_Qc0eGj+rU1N6vWA2Mg@mail.gmail.com> <0A7D041A-A72D-4151-9476-BCCEC157C5E4@gmail.com> <CADY20s5NLVxV3Wdi1mSmU6Q+ip4juvc0t1yMJ6Bx51qtdei2nw@mail.gmail.com> <BDB4BC8D-0E13-4958-83B3-2BE3C1FE9979@gmail.com> <CADY20s6N037deZD7PxG1M5+S8_mV=hc3R4_J=nfWSCFXMBgeEg@mail.gmail.com> <2E7BBD97-2A48-49F8-AE0C-F7CF6D463EF6@gmail.com> <CADY20s56TShS5nV2BDN3EfefLuAjsCdDs4gjXg=9x+QOm_uRRQ@mail.gmail.com> <0175380F-7464-4CD9-BB01-77164A109592@gmail.com>,CADY20s7=O_XV9x=NPSo+4ZsbFm0bAQ5kCyAj=F0btc-c2hON=Q@mail.gmail.com
In-Reply-To: CADY20s7=O_XV9x=NPSo+4ZsbFm0bAQ5kCyAj=F0btc-c2hON=Q@mail.gmail.com
Content-Type: multipart/alternative;
  boundary="----=ALIBOUNDARY_14886_50262940_577a3070_30086"
archived-at: Mon, 04 Jul 2016 09:46:51 -0000

------=ALIBOUNDARY_14886_50262940_577a3070_30086
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

How can I delete data in kudu table wiht spark =C2=A0(not delete the table at =
all)?------------------------------------------------------------------=E5=8F=91=
=E4=BB=B6=E4=BA=BA=EF=BC=9ATodd Lipcon <todd@cloudera.com>=E5=8F=91=E9=80=81=E6=
=97=B6=E9=97=B4=EF=BC=9A2016=E5=B9=B47=E6=9C=882=E6=97=A5(=E6=98=9F=E6=9C=9F=E5=
=85=AD) 02:44=E6=94=B6=E4=BB=B6=E4=BA=BA=EF=BC=9Auser <user@kudu.incubator.apa=
che.org>=E4=B8=BB=E3=80=80=E9=A2=98=EF=BC=9ARe: Performance Question=0AOn Thu,=
 Jun 30, 2016 at 5:39 PM, Benjamin Kim <bbuild11@gmail.com> wrote:=0AHi Todd,=0A=
I changed the key to be what you suggested, and I can=E2=80=99t tell the diffe=
rence since it was already fast. But, I did get more numbers.=0AYea, you won't=
 see a substantial difference until you're inserting billions of rows, etc, an=
d the keys and/or bloom filters no longer fit in cache.=C2=A0=0A> 104M rows in=
 Kudu table- read: 8s- count: 16s- aggregate: 9s=0AThe time to read took much =
longer from 0.2s to 8s, counts were the same 16s, and aggregate queries look l=
onger from 6s to 9s.=0AI=E2=80=99m still impressed.=0AWe aim to please ;-) If =
you have any interest in writing up these experiments as a blog post, would be=
 cool to post them for others to learn from.=0A-Todd=C2=A0On Jun 15, 2016, at =
12:47 AM, Todd Lipcon <todd@cloudera.com> wrote:=0AHi Benjamin,What workload a=
re you using for benchmarks? Using spark or something more custom? rdd or data=
 frame or SQL, etc? Maybe you can share the schema and some queriesToddToddOn =
Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:=0AHi Todd,=0A=
Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. C=
ompared to HBase, read and write performance are better. Write performance has=
 the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only=
 preliminary tests. Do you know of a way to really do some conclusive tests? I=
 want to see if I can match your results on my 50 node cluster.=0AThanks,Ben=0A=
=0AOn May 30, 2016, at 10:33 AM, Todd Lipcon <todd@cloudera.com> wrote:=0AOn S=
at, May 28, 2016 at 7:12 AM, Benjamin Kim=C2=A0<bbuild11@gmail.com>=C2=A0wrote=
:=0ATodd,=0AIt sounds like Kudu can possibly top or match those numbers put ou=
t by Aerospike. Do you have any performance statistics published or any instru=
ctions as to measure them myself as good way to test? In addition, this will b=
e a test using Spark, so should I wait for Kudu version 0.9.0 where support wi=
ll be built in?=0AWe don't have a lot of benchmarks published yet, especially =
on the write side. I've found that thorough cross-system benchmarks are very d=
ifficult to do fairly and accurately, and often times users end up misguided i=
f they pay too much attention to them :) So, given a finite number of develope=
rs working on Kudu, I think we've tended to spend more time on the project its=
elf and less time focusing on "competition". I'm sure there are use cases wher=
e Kudu will beat out Aerospike, and probably use cases where Aerospike will be=
at Kudu as well.=0AFrom my perspective, it would be great if you can share som=
e details of your workload, especially if there are some areas you're finding =
Kudu lacking. Maybe we can spot some easy code changes we could make to improv=
e performance, or suggest a tuning variable you could change.=0A-Todd=0A=0AOn =
May 27, 2016, at 9:19 PM, Todd Lipcon <todd@cloudera.com> wrote:=0AOn Fri, May=
 27, 2016 at 8:20 PM, Benjamin Kim=C2=A0<bbuild11@gmail.com>=C2=A0wrote:=0AHi =
Mike,=0AFirst of all, thanks for the link. It looks like an interesting read. =
I checked that Aerospike is currently at version 3.8.2.3, and in the article, =
they are evaluating version 3.5.4. The main thing that impressed me was their =
claim that they can beat Cassandra and HBase by 8x for writing and 25x for rea=
ding. Their big claim to fame is that Aerospike can write 1M records per secon=
d with only 50 nodes. I wanted to see if this is real.=0A1M records per second=
 on 50 nodes is pretty doable by Kudu as well, depending on the size of your r=
ecords and the insertion order. I've been playing with a ~70 node cluster rece=
ntly and seen 1M+ writes/second sustained, and bursting above 4M. These are 1K=
B rows with 11 columns, and with pretty old HDD-only nodes. I think newer flas=
h-based nodes could do better.=C2=A0=0ATo answer your questions, we have a DMP=
 with user profiles with many attributes. We create segmentation information o=
ff of these attributes to classify them. Then, we can target advertising appro=
priately for our sales department. Much of the data processing is for applying=
 models on all or if not most of every profile=E2=80=99s attributes to find si=
milarities (nearest neighbor/clustering) over a large number of rows when batc=
h processing or a small subset of rows for quick online scoring. So, our use c=
ase is a typical advanced analytics scenario. We have tried HBase, but it does=
n=E2=80=99t work well for these types of analytics.=0AI read, that Aerospike i=
n the release notes, they did do many improvements for batch and scan operatio=
ns.=0AI wonder what your thoughts are for using Kudu for this.=0ASounds like a=
 good Kudu use case to me. I've heard great things about Aerospike for the low=
 latency random access portion, but I've also heard that it's _very_ expensive=
, and not particularly suited to the columnar scan workload. Lastly, I think t=
he Apache license of Kudu is much more appealing than the AGPL3 used by Aerosp=
ike. But, that's not really a direct answer to the performance question :)=C2=A0=
=0AThanks,Ben=0A=0AOn May 27, 2016, at 6:21 PM, Mike Percy <mpercy@cloudera.co=
m> wrote:=0AHave you considered whether you have a scan heavy or a random acce=
ss heavy workload? Have you considered whether you always access / update a wh=
ole row vs only a partial row? Kudu is a column store so has some awesome=C2=A0=
performance characteristics when you are doing a lot of scanning of just a cou=
ple of=C2=A0columns.=0AI don't know the answer to your question but if your co=
ncern is performance then I would be interested=C2=A0in seeing comparisons fro=
m a perf perspective on certain workloads.=0AFinally, a year ago=C2=A0Aerospik=
e did quite poorly in a Jepsen test:=C2=A0https://aphyr.com/posts/324-jepsen-a=
erospike=0AI wonder if they have addressed any of those issues.=0AMike=0A=0AOn=
 Friday, May 27, 2016, Benjamin Kim <bbuild11@gmail.com> wrote:=0AI am just cu=
rious. How will Kudu compare with Aerospike (http://www.aerospike.com)? I went=
 to a Spark Roadshow and found out about this piece of software. It appears to=
 fit our use case perfectly since we are an ad-tech company trying to leverage=
 our user profiles data. Plus, it already has a Spark connector and has a SQL-=
like client. The tables can be accessed using Spark SQL DataFrames and, also, =
made into SQL tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I s=
ee from the work done here=C2=A0http://gerrit.cloudera.org:8080/#/c/2992/=C2=A0=
that the Spark integration is well underway and, from the looks of it lately, =
almost complete. I would prefer to use Kudu since we are already a Cloudera sh=
op, and Kudu is easy to deploy and configure using Cloudera Manager. I also ho=
pe that some of Aerospike=E2=80=99s speed optimization techniques can make it =
into Kudu in the future, if they have not been already thought of or included.=
=0A=0AJust some thoughts=E2=80=A6=0A=0ACheers,=0ABen=0A=0A--=C2=A0=0A--=0AMike=
 Percy=0ASoftware Engineer, Cloudera=0A=0A=0A=0A=0A=0A--=C2=A0=0ATodd Lipcon=0A=
Software Engineer, Cloudera=0A=0A=0A=0A--=C2=A0=0ATodd Lipcon=0ASoftware Engin=
eer, Cloudera=0A=0A=0A=0A=0A-- =0ATodd Lipcon=0ASoftware Engineer, Cloudera=0A
------=ALIBOUNDARY_14886_50262940_577a3070_30086
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div class=3D"__aliyun_email_body_block"><div  style=3D"clear:both;"><span  st=
yle=3D"font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;=
">How can I delete data in kudu table wiht spark &nbsp;(not delete the table a=
t all)?</span></div><blockquote  style=3D"margin-right:.0px;margin-top:.0px;ma=
rgin-bottom:.0px;"><div  style=3D"clear:both;"><span  style=3D"font-family:Tah=
oma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;">--------------------=
----------------------------------------------</span></div><div  style=3D"clea=
r:both;"><span  style=3D"font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.=
0px;color:#000000;">=E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9ATodd Lipcon &lt;todd@c=
loudera.com&gt;</span></div><div  style=3D"clear:both;"><span  style=3D"font-f=
amily:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;">=E5=8F=91=E9=
=80=81=E6=97=B6=E9=97=B4=EF=BC=9A2016=E5=B9=B47=E6=9C=882=E6=97=A5(=E6=98=9F=E6=
=9C=9F=E5=85=AD) 02:44</span></div><div  style=3D"clear:both;"><span  style=3D=
"font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;">=E6=94=
=B6=E4=BB=B6=E4=BA=BA=EF=BC=9Auser &lt;user@kudu.incubator.apache.org&gt;</spa=
n></div><div  style=3D"clear:both;"><span  style=3D"font-family:Tahoma,Arial,S=
THeiti,SimSun;font-size:14.0px;color:#000000;">=E4=B8=BB=E3=80=80=E9=A2=98=EF=BC=
=9ARe: Performance Question</span></div><div  style=3D"clear:both;"><span  sty=
le=3D"font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;"=
><br ></span></div><div ><div class=3D"gmail_extra"><div class=3D"gmail_quote"=
>On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim &lt;<a  href=3D"mailto:bbuild11=
@gmail.com">bbuild11@gmail.com</a>&gt; wrote:<br ><div  style=3D"word-wrap:bre=
ak-word;">Hi Todd,<div ><br ></div><div >I changed the key to be what you sugg=
ested, and I can=E2=80=99t tell the difference since it was already fast. But,=
 I did get more numbers.</div></div><div ><br ></div><div >Yea, you won't see =
a substantial difference until you're inserting billions of rows, etc, and the=
 keys and/or bloom filters no longer fit in cache.</div><div >&nbsp;</div><div=
  style=3D"word-wrap:break-word;"><div ><br ></div><div ><div >&gt; 104M rows =
in Kudu table</div><div >- read: 8s</div><div >- count: 16s</div><div >- aggre=
gate: 9s</div></div><div ><br ></div><div >The time to read took much longer f=
rom 0.2s to 8s, counts were the same 16s, and aggregate queries look longer fr=
om 6s to 9s.</div></div><div  style=3D"word-wrap:break-word;"><div ><br ></div=
><div >I=E2=80=99m still impressed.</div></div><div ><br ></div><div >We aim t=
o please ;-) If you have any interest in writing up these experiments as a blo=
g post, would be cool to post them for others to learn from.</div><div ><br ><=
/div><div >-Todd</div><div >&nbsp;</div><div  style=3D"word-wrap:break-word;">=
<div ><div class=3D"h5"><div ><div ><div >On Jun 15, 2016, at 12:47 AM, Todd L=
ipcon &lt;<a  href=3D"mailto:todd@cloudera.com">todd@cloudera.com</a>&gt; wrot=
e:</div><br ><div ><p >Hi Benjamin,</p><p >What workload are you using for ben=
chmarks? Using spark or something more custom? rdd or data frame or SQL, etc? =
Maybe you can share the schema and some queries</p><p >Todd</p><p >Todd</p><di=
v class=3D"gmail_quote">On Jun 15, 2016 8:10 AM, "Benjamin Kim" &lt;<a  href=3D=
"mailto:bbuild11@gmail.com">bbuild11@gmail.com</a>&gt; wrote:<br ><div  style=3D=
"word-wrap:break-word;"><div >Hi Todd,</div><div ><br ></div>Now that Kudu 0.9=
.0 is out. I have done some tests. Already, I am impressed. Compared to HBase,=
 read and write performance are better. Write performance has the greatest imp=
rovement (&gt; 4x), while read is &gt; 1.5x. Albeit, these are only preliminar=
y tests. Do you know of a way to really do some conclusive tests? I want to se=
e if I can match your results on my 50 node cluster.<div ><br ></div><div >Tha=
nks,</div><div >Ben<br ><br ><div ><div >On May 30, 2016, at 10:33 AM, Todd Li=
pcon &lt;<a  href=3D"mailto:todd@cloudera.com">todd@cloudera.com</a>&gt; wrote=
:</div><br ><div ><div  style=3D"font-family:ArialMT;font-size:12.0px;font-sty=
le:normal;font-weight:normal;text-align:start;text-indent:.0px;text-transform:=
none;"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On Sat, May 28, 2=
016 at 7:12 AM, Benjamin Kim&nbsp;&lt;<a  href=3D"mailto:bbuild11@gmail.com">b=
build11@gmail.com</a>&gt;&nbsp;wrote:<br ><div  style=3D"word-wrap:break-word;=
">Todd,<div ><br ></div><div >It sounds like Kudu can possibly top or match th=
ose numbers put out by Aerospike. Do you have any performance statistics publi=
shed or any instructions as to measure them myself as good way to test? In add=
ition, this will be a test using Spark, so should I wait for Kudu version 0.9.=
0 where support will be built in?</div></div><div ><br ></div><div >We don't h=
ave a lot of benchmarks published yet, especially on the write side. I've foun=
d that thorough cross-system benchmarks are very difficult to do fairly and ac=
curately, and often times users end up misguided if they pay too much attentio=
n to them :) So, given a finite number of developers working on Kudu, I think =
we've tended to spend more time on the project itself and less time focusing o=
n "competition". I'm sure there are use cases where Kudu will beat out Aerospi=
ke, and probably use cases where Aerospike will beat Kudu as well.</div><div >=
<br ></div><div >From my perspective, it would be great if you can share some =
details of your workload, especially if there are some areas you're finding Ku=
du lacking. Maybe we can spot some easy code changes we could make to improve =
performance, or suggest a tuning variable you could change.</div><div ><br ></=
div><div >-Todd</div><div ><br ></div><div  style=3D"word-wrap:break-word;"><d=
iv ><div ><div ><br ><div ><div >On May 27, 2016, at 9:19 PM, Todd Lipcon &lt;=
<a  href=3D"mailto:todd@cloudera.com">todd@cloudera.com</a>&gt; wrote:</div><b=
r ><div ><div ><div class=3D"gmail_extra"><div class=3D"gmail_quote">On Fri, M=
ay 27, 2016 at 8:20 PM, Benjamin Kim&nbsp;&lt;<a  href=3D"mailto:bbuild11@gmai=
l.com">bbuild11@gmail.com</a>&gt;&nbsp;wrote:<br ><div  style=3D"word-wrap:bre=
ak-word;">Hi Mike,<div ><br ></div><div >First of all, thanks for the link. It=
 looks like an interesting read. I checked that Aerospike is currently at vers=
ion 3.8.2.3, and in the article, they are evaluating version 3.5.4. The main t=
hing that impressed me was their claim that they can beat Cassandra and HBase =
by 8x for writing and 25x for reading. Their big claim to fame is that Aerospi=
ke can write 1M records per second with only 50 nodes. I wanted to see if this=
 is real.</div></div><div ><br ></div><div >1M records per second on 50 nodes =
is pretty doable by Kudu as well, depending on the size of your records and th=
e insertion order. I've been playing with a ~70 node cluster recently and seen=
 1M+ writes/second sustained, and bursting above 4M. These are 1KB rows with 1=
1 columns, and with pretty old HDD-only nodes. I think newer flash-based nodes=
 could do better.</div><div >&nbsp;</div><div  style=3D"word-wrap:break-word;"=
><div ><br ></div><div >To answer your questions, we have a DMP with user prof=
iles with many attributes. We create segmentation information off of these att=
ributes to classify them. Then, we can target advertising appropriately for ou=
r sales department. Much of the data processing is for applying models on all =
or if not most of every profile=E2=80=99s attributes to find similarities (nea=
rest neighbor/clustering) over a large number of rows when batch processing or=
 a small subset of rows for quick online scoring. So, our use case is a typica=
l advanced analytics scenario. We have tried HBase, but it doesn=E2=80=99t wor=
k well for these types of analytics.</div><div ><br ></div><div >I read, that =
Aerospike in the release notes, they did do many improvements for batch and sc=
an operations.</div><div ><br ></div><div >I wonder what your thoughts are for=
 using Kudu for this.</div></div><div ><br ></div><div >Sounds like a good Kud=
u use case to me. I've heard great things about Aerospike for the low latency =
random access portion, but I've also heard that it's _very_ expensive, and not=
 particularly suited to the columnar scan workload. Lastly, I think the Apache=
 license of Kudu is much more appealing than the AGPL3 used by Aerospike. But,=
 that's not really a direct answer to the performance question :)</div><div >&=
nbsp;</div><div  style=3D"word-wrap:break-word;"><div ><br ></div><div >Thanks=
,</div><div >Ben</div><div ><div ><div ><br ><br ><div ><div >On May 27, 2016,=
 at 6:21 PM, Mike Percy &lt;<a  href=3D"mailto:mpercy@cloudera.com">mpercy@clo=
udera.com</a>&gt; wrote:</div><br ><div >Have you considered whether you have =
a scan heavy or a random access heavy workload? Have you considered whether yo=
u always access / update a whole row vs only a partial row? Kudu is a column s=
tore so has some awesome&nbsp;performance characteristics when you are doing a=
 lot of scanning of just a couple of&nbsp;columns.<div ><br ></div><div >I don=
't know the answer to your question but if your concern is performance then I =
would be interested&nbsp;in seeing comparisons from a perf perspective on cert=
ain workloads.</div><div ><br ></div><div >Finally, a year ago&nbsp;Aerospike =
did quite poorly in a Jepsen test:&nbsp;<a  href=3D"https://aphyr.com/posts/32=
4-jepsen-aerospike">https://aphyr.com/posts/324-jepsen-aerospike</a></div><div=
 ><br ></div><div >I wonder if they have addressed any of those issues.</div><=
div ><br ></div><div >Mike<br ><br >On Friday, May 27, 2016, Benjamin Kim &lt;=
<a  href=3D"mailto:bbuild11@gmail.com">bbuild11@gmail.com</a>&gt; wrote:<br >I=
 am just curious. How will Kudu compare with Aerospike (<a  href=3D"http://www=
.aerospike.com/">http://www.aerospike.com</a>)? I went to a Spark Roadshow and=
 found out about this piece of software. It appears to fit our use case perfec=
tly since we are an ad-tech company trying to leverage our user profiles data.=
 Plus, it already has a Spark connector and has a SQL-like client. The tables =
can be accessed using Spark SQL DataFrames and, also, made into SQL tables for=
 direct use with Spark SQL ODBC/JDBC Thriftserver. I see from the work done he=
re&nbsp;<a  href=3D"http://gerrit.cloudera.org:8080/#/c/2992/">http://gerrit.c=
loudera.org:8080/#/c/2992/</a>&nbsp;that the Spark integration is well underwa=
y and, from the looks of it lately, almost complete. I would prefer to use Kud=
u since we are already a Cloudera shop, and Kudu is easy to deploy and configu=
re using Cloudera Manager. I also hope that some of Aerospike=E2=80=99s speed =
optimization techniques can make it into Kudu in the future, if they have not =
been already thought of or included.<br ><br >Just some thoughts=E2=80=A6<br >=
<br >Cheers,<br >Ben</div><br ><br >--&nbsp;<br ><div ><div ><div ><div >--<br=
 >Mike Percy<br >Software Engineer, Cloudera</div><div ><br ></div></div></div=
></div><br ></div></div><br ></div></div></div></div></div><br ><br clear=3D"a=
ll"><div ><br ></div>--&nbsp;<br ><div >Todd Lipcon<br >Software Engineer, Clo=
udera</div></div></div></div></div><br ></div></div></div></div></div><br ><br=
 clear=3D"all"><div ><br ></div>--&nbsp;<br ><div >Todd Lipcon<br >Software En=
gineer, Cloudera</div></div></div></div></div><br ></div></div></div></div></d=
iv><br ></div></div></div></div></div><br ><br clear=3D"all"><div ><br ></div>=
-- <br ><div class=3D"gmail_signature">Todd Lipcon<br >Software Engineer, Clou=
dera</div></div></div></blockquote><div ><br ></div></div>
------=ALIBOUNDARY_14886_50262940_577a3070_30086--