Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 86B2C200B30 for ; Mon, 4 Jul 2016 11:46:51 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 85441160A65; Mon, 4 Jul 2016 09:46:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 57117160A55 for ; Mon, 4 Jul 2016 11:46:50 +0200 (CEST) Received: (qmail 28587 invoked by uid 500); 4 Jul 2016 09:46:49 -0000 Mailing-List: contact user-help@kudu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.incubator.apache.org Delivered-To: mailing list user@kudu.incubator.apache.org Received: (qmail 28579 invoked by uid 99); 4 Jul 2016 09:46:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jul 2016 09:46:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DB6BBC0155 for ; Mon, 4 Jul 2016 09:46:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.901 X-Spam-Level: * X-Spam-Status: No, score=1.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=alibaba-inc.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id QiTKm__89zxo for ; Mon, 4 Jul 2016 09:46:39 +0000 (UTC) Received: from out4133-82.mail.aliyun.com (out4133-82.mail.aliyun.com [42.120.133.82]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTP id 073B65F1F6 for ; Mon, 4 Jul 2016 09:46:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alibaba-inc.com; s=default; t=1467625590; h=Date:From:To:Message-ID:Subject:MIME-Version:Content-Type; bh=zpFNIQ9rNLoltDF6zbCgYDPPBDky9UJ/0wEe4RyuguU=; b=k/1wlh4bDZoMA/5QegK7PqlM3tYZS7txcuAx2kqDKmyy6OXm4dYnTzyr2Ksh3Y7Rm4tbQ7exPWqlBIBtslaAAhhGZkhoeU+wAxs4UMhDHJUTZNDuEEPww/bgsC2f1rOsMWhqx6SKlX6/QiSAM6X+c/kacJypAHmfbxA8BivMHzs= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R451e4;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e02c03301;MF=yuankang.yk@alibaba-inc.com;NM=1;PH=DW;RN=1;SR=0;TI=W4_4664275_0A9323E5_1467625583991_c256d; Received: from WS-web (yuankang.yk@alibaba-inc.com[182.92.253.3]) by e01l07385.eu6 at Mon, 04 Jul 2016 17:46:24 +0800 Date: Mon, 04 Jul 2016 17:46:24 +0800 From: "=?UTF-8?B?6KKB5bq377yI5qKT5oKg77yJ?=" To: "user" Reply-To: "=?UTF-8?B?6KKB5bq377yI5qKT5oKg77yJ?=" Message-ID: <48ae4858-34bc-4456-b12f-4018236e53fb.yuankang.yk@alibaba-inc.com> Subject: =?UTF-8?B?5Zue5aSN77yaUGVyZm9ybWFuY2UgUXVlc3Rpb24=?= X-Mailer: Alimail-Mailagent revision 864 MIME-Version: 1.0 References: <55B8BF95-5704-46CA-A336-64EE4D2B91B2@gmail.com> <0A7D041A-A72D-4151-9476-BCCEC157C5E4@gmail.com> <2E7BBD97-2A48-49F8-AE0C-F7CF6D463EF6@gmail.com> <0175380F-7464-4CD9-BB01-77164A109592@gmail.com>,CADY20s7=O_XV9x=NPSo+4ZsbFm0bAQ5kCyAj=F0btc-c2hON=Q@mail.gmail.com In-Reply-To: CADY20s7=O_XV9x=NPSo+4ZsbFm0bAQ5kCyAj=F0btc-c2hON=Q@mail.gmail.com x-aliyun-mail-creator: W4_4664275_M3LTW96aWxsYS81LjAgKE1hY2ludG9zaDsgSW50ZWwgTWFjIE9TIFggMTBfMTFfNSkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzUxLjAuMjcwNC4xMDMgU2FmYXJpLzUzNy4zNg==vN Content-Type: multipart/alternative; boundary="----=ALIBOUNDARY_14886_50262940_577a3070_30086" archived-at: Mon, 04 Jul 2016 09:46:51 -0000 ------=ALIBOUNDARY_14886_50262940_577a3070_30086 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable How can I delete data in kudu table wiht spark =C2=A0(not delete the table at = all)?------------------------------------------------------------------=E5=8F=91= =E4=BB=B6=E4=BA=BA=EF=BC=9ATodd Lipcon =E5=8F=91=E9=80=81=E6= =97=B6=E9=97=B4=EF=BC=9A2016=E5=B9=B47=E6=9C=882=E6=97=A5(=E6=98=9F=E6=9C=9F=E5= =85=AD) 02:44=E6=94=B6=E4=BB=B6=E4=BA=BA=EF=BC=9Auser =E4=B8=BB=E3=80=80=E9=A2=98=EF=BC=9ARe: Performance Question=0AOn Thu,= Jun 30, 2016 at 5:39 PM, Benjamin Kim wrote:=0AHi Todd,=0A= I changed the key to be what you suggested, and I can=E2=80=99t tell the diffe= rence since it was already fast. But, I did get more numbers.=0AYea, you won't= see a substantial difference until you're inserting billions of rows, etc, an= d the keys and/or bloom filters no longer fit in cache.=C2=A0=0A> 104M rows in= Kudu table- read: 8s- count: 16s- aggregate: 9s=0AThe time to read took much = longer from 0.2s to 8s, counts were the same 16s, and aggregate queries look l= onger from 6s to 9s.=0AI=E2=80=99m still impressed.=0AWe aim to please ;-) If = you have any interest in writing up these experiments as a blog post, would be= cool to post them for others to learn from.=0A-Todd=C2=A0On Jun 15, 2016, at = 12:47 AM, Todd Lipcon wrote:=0AHi Benjamin,What workload a= re you using for benchmarks? Using spark or something more custom? rdd or data= frame or SQL, etc? Maybe you can share the schema and some queriesToddToddOn = Jun 15, 2016 8:10 AM, "Benjamin Kim" wrote:=0AHi Todd,=0A= Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. C= ompared to HBase, read and write performance are better. Write performance has= the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only= preliminary tests. Do you know of a way to really do some conclusive tests? I= want to see if I can match your results on my 50 node cluster.=0AThanks,Ben=0A= =0AOn May 30, 2016, at 10:33 AM, Todd Lipcon wrote:=0AOn S= at, May 28, 2016 at 7:12 AM, Benjamin Kim=C2=A0=C2=A0wrote= :=0ATodd,=0AIt sounds like Kudu can possibly top or match those numbers put ou= t by Aerospike. Do you have any performance statistics published or any instru= ctions as to measure them myself as good way to test? In addition, this will b= e a test using Spark, so should I wait for Kudu version 0.9.0 where support wi= ll be built in?=0AWe don't have a lot of benchmarks published yet, especially = on the write side. I've found that thorough cross-system benchmarks are very d= ifficult to do fairly and accurately, and often times users end up misguided i= f they pay too much attention to them :) So, given a finite number of develope= rs working on Kudu, I think we've tended to spend more time on the project its= elf and less time focusing on "competition". I'm sure there are use cases wher= e Kudu will beat out Aerospike, and probably use cases where Aerospike will be= at Kudu as well.=0AFrom my perspective, it would be great if you can share som= e details of your workload, especially if there are some areas you're finding = Kudu lacking. Maybe we can spot some easy code changes we could make to improv= e performance, or suggest a tuning variable you could change.=0A-Todd=0A=0AOn = May 27, 2016, at 9:19 PM, Todd Lipcon wrote:=0AOn Fri, May= 27, 2016 at 8:20 PM, Benjamin Kim=C2=A0=C2=A0wrote:=0AHi = Mike,=0AFirst of all, thanks for the link. It looks like an interesting read. = I checked that Aerospike is currently at version 3.8.2.3, and in the article, = they are evaluating version 3.5.4. The main thing that impressed me was their = claim that they can beat Cassandra and HBase by 8x for writing and 25x for rea= ding. Their big claim to fame is that Aerospike can write 1M records per secon= d with only 50 nodes. I wanted to see if this is real.=0A1M records per second= on 50 nodes is pretty doable by Kudu as well, depending on the size of your r= ecords and the insertion order. I've been playing with a ~70 node cluster rece= ntly and seen 1M+ writes/second sustained, and bursting above 4M. These are 1K= B rows with 11 columns, and with pretty old HDD-only nodes. I think newer flas= h-based nodes could do better.=C2=A0=0ATo answer your questions, we have a DMP= with user profiles with many attributes. We create segmentation information o= ff of these attributes to classify them. Then, we can target advertising appro= priately for our sales department. Much of the data processing is for applying= models on all or if not most of every profile=E2=80=99s attributes to find si= milarities (nearest neighbor/clustering) over a large number of rows when batc= h processing or a small subset of rows for quick online scoring. So, our use c= ase is a typical advanced analytics scenario. We have tried HBase, but it does= n=E2=80=99t work well for these types of analytics.=0AI read, that Aerospike i= n the release notes, they did do many improvements for batch and scan operatio= ns.=0AI wonder what your thoughts are for using Kudu for this.=0ASounds like a= good Kudu use case to me. I've heard great things about Aerospike for the low= latency random access portion, but I've also heard that it's _very_ expensive= , and not particularly suited to the columnar scan workload. Lastly, I think t= he Apache license of Kudu is much more appealing than the AGPL3 used by Aerosp= ike. But, that's not really a direct answer to the performance question :)=C2=A0= =0AThanks,Ben=0A=0AOn May 27, 2016, at 6:21 PM, Mike Percy wrote:=0AHave you considered whether you have a scan heavy or a random acce= ss heavy workload? Have you considered whether you always access / update a wh= ole row vs only a partial row? Kudu is a column store so has some awesome=C2=A0= performance characteristics when you are doing a lot of scanning of just a cou= ple of=C2=A0columns.=0AI don't know the answer to your question but if your co= ncern is performance then I would be interested=C2=A0in seeing comparisons fro= m a perf perspective on certain workloads.=0AFinally, a year ago=C2=A0Aerospik= e did quite poorly in a Jepsen test:=C2=A0https://aphyr.com/posts/324-jepsen-a= erospike=0AI wonder if they have addressed any of those issues.=0AMike=0A=0AOn= Friday, May 27, 2016, Benjamin Kim wrote:=0AI am just cu= rious. How will Kudu compare with Aerospike (http://www.aerospike.com)? I went= to a Spark Roadshow and found out about this piece of software. It appears to= fit our use case perfectly since we are an ad-tech company trying to leverage= our user profiles data. Plus, it already has a Spark connector and has a SQL-= like client. The tables can be accessed using Spark SQL DataFrames and, also, = made into SQL tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I s= ee from the work done here=C2=A0http://gerrit.cloudera.org:8080/#/c/2992/=C2=A0= that the Spark integration is well underway and, from the looks of it lately, = almost complete. I would prefer to use Kudu since we are already a Cloudera sh= op, and Kudu is easy to deploy and configure using Cloudera Manager. I also ho= pe that some of Aerospike=E2=80=99s speed optimization techniques can make it = into Kudu in the future, if they have not been already thought of or included.= =0A=0AJust some thoughts=E2=80=A6=0A=0ACheers,=0ABen=0A=0A--=C2=A0=0A--=0AMike= Percy=0ASoftware Engineer, Cloudera=0A=0A=0A=0A=0A=0A--=C2=A0=0ATodd Lipcon=0A= Software Engineer, Cloudera=0A=0A=0A=0A--=C2=A0=0ATodd Lipcon=0ASoftware Engin= eer, Cloudera=0A=0A=0A=0A=0A-- =0ATodd Lipcon=0ASoftware Engineer, Cloudera=0A ------=ALIBOUNDARY_14886_50262940_577a3070_30086 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
How can I delete data in kudu table wiht spark  (not delete the table a= t all)?
--------------------= ----------------------------------------------
=E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9ATodd Lipcon <todd@c= loudera.com>
=E5=8F=91=E9= =80=81=E6=97=B6=E9=97=B4=EF=BC=9A2016=E5=B9=B47=E6=9C=882=E6=97=A5(=E6=98=9F=E6= =9C=9F=E5=85=AD) 02:44
=E6=94= =B6=E4=BB=B6=E4=BA=BA=EF=BC=9Auser <user@kudu.incubator.apache.org>
=E4=B8=BB=E3=80=80=E9=A2=98=EF=BC= =9ARe: Performance Question

On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
Hi Todd,

I changed the key to be what you sugg= ested, and I can=E2=80=99t tell the difference since it was already fast. But,= I did get more numbers.

Yea, you won't see = a substantial difference until you're inserting billions of rows, etc, and the= keys and/or bloom filters no longer fit in cache.
 

> 104M rows = in Kudu table
- read: 8s
- count: 16s
- aggre= gate: 9s

The time to read took much longer f= rom 0.2s to 8s, counts were the same 16s, and aggregate queries look longer fr= om 6s to 9s.

I=E2=80=99m still impressed.

We aim t= o please ;-) If you have any interest in writing up these experiments as a blo= g post, would be cool to post them for others to learn from.

<= /div>
-Todd
 
=
On Jun 15, 2016, at 12:47 AM, Todd L= ipcon <todd@cloudera.com> wrot= e:

Hi Benjamin,

What workload are you using for ben= chmarks? Using spark or something more custom? rdd or data frame or SQL, etc? = Maybe you can share the schema and some queries

Todd

Todd

On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
Hi Todd,

Now that Kudu 0.9= .0 is out. I have done some tests. Already, I am impressed. Compared to HBase,= read and write performance are better. Write performance has the greatest imp= rovement (> 4x), while read is > 1.5x. Albeit, these are only preliminar= y tests. Do you know of a way to really do some conclusive tests? I want to se= e if I can match your results on my 50 node cluster.

Tha= nks,
Ben

On May 30, 2016, at 10:33 AM, Todd Li= pcon <todd@cloudera.com> wrote= :

On Sat, May 28, 2= 016 at 7:12 AM, Benjamin Kim <b= build11@gmail.com> wrote:
Todd,

It sounds like Kudu can possibly top or match th= ose numbers put out by Aerospike. Do you have any performance statistics publi= shed or any instructions as to measure them myself as good way to test? In add= ition, this will be a test using Spark, so should I wait for Kudu version 0.9.= 0 where support will be built in?

We don't h= ave a lot of benchmarks published yet, especially on the write side. I've foun= d that thorough cross-system benchmarks are very difficult to do fairly and ac= curately, and often times users end up misguided if they pay too much attentio= n to them :) So, given a finite number of developers working on Kudu, I think = we've tended to spend more time on the project itself and less time focusing o= n "competition". I'm sure there are use cases where Kudu will beat out Aerospi= ke, and probably use cases where Aerospike will beat Kudu as well.
=
From my perspective, it would be great if you can share some = details of your workload, especially if there are some areas you're finding Ku= du lacking. Maybe we can spot some easy code changes we could make to improve = performance, or suggest a tuning variable you could change.

-Todd


On May 27, 2016, at 9:19 PM, Todd Lipcon <= todd@cloudera.com> wrote:
On Fri, M= ay 27, 2016 at 8:20 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
Hi Mike,

First of all, thanks for the link. It= looks like an interesting read. I checked that Aerospike is currently at vers= ion 3.8.2.3, and in the article, they are evaluating version 3.5.4. The main t= hing that impressed me was their claim that they can beat Cassandra and HBase = by 8x for writing and 25x for reading. Their big claim to fame is that Aerospi= ke can write 1M records per second with only 50 nodes. I wanted to see if this= is real.

1M records per second on 50 nodes = is pretty doable by Kudu as well, depending on the size of your records and th= e insertion order. I've been playing with a ~70 node cluster recently and seen= 1M+ writes/second sustained, and bursting above 4M. These are 1KB rows with 1= 1 columns, and with pretty old HDD-only nodes. I think newer flash-based nodes= could do better.
 

To answer your questions, we have a DMP with user prof= iles with many attributes. We create segmentation information off of these att= ributes to classify them. Then, we can target advertising appropriately for ou= r sales department. Much of the data processing is for applying models on all = or if not most of every profile=E2=80=99s attributes to find similarities (nea= rest neighbor/clustering) over a large number of rows when batch processing or= a small subset of rows for quick online scoring. So, our use case is a typica= l advanced analytics scenario. We have tried HBase, but it doesn=E2=80=99t wor= k well for these types of analytics.

I read, that = Aerospike in the release notes, they did do many improvements for batch and sc= an operations.

I wonder what your thoughts are for= using Kudu for this.

Sounds like a good Kud= u use case to me. I've heard great things about Aerospike for the low latency = random access portion, but I've also heard that it's _very_ expensive, and not= particularly suited to the columnar scan workload. Lastly, I think the Apache= license of Kudu is much more appealing than the AGPL3 used by Aerospike. But,= that's not really a direct answer to the performance question :)
&= nbsp;

Thanks= ,
Ben


On May 27, 2016,= at 6:21 PM, Mike Percy <mpercy@clo= udera.com> wrote:

Have you considered whether you have = a scan heavy or a random access heavy workload? Have you considered whether yo= u always access / update a whole row vs only a partial row? Kudu is a column s= tore so has some awesome performance characteristics when you are doing a= lot of scanning of just a couple of columns.

I don= 't know the answer to your question but if your concern is performance then I = would be interested in seeing comparisons from a perf perspective on cert= ain workloads.

Finally, a year ago Aerospike = did quite poorly in a Jepsen test: https://aphyr.com/posts/324-jepsen-aerospike

I wonder if they have addressed any of those issues.
<= div >
Mike

On Friday, May 27, 2016, Benjamin Kim <= bbuild11@gmail.com> wrote:
I= am just curious. How will Kudu compare with Aerospike (http://www.aerospike.com)? I went to a Spark Roadshow and= found out about this piece of software. It appears to fit our use case perfec= tly since we are an ad-tech company trying to leverage our user profiles data.= Plus, it already has a Spark connector and has a SQL-like client. The tables = can be accessed using Spark SQL DataFrames and, also, made into SQL tables for= direct use with Spark SQL ODBC/JDBC Thriftserver. I see from the work done he= re http://gerrit.c= loudera.org:8080/#/c/2992/ that the Spark integration is well underwa= y and, from the looks of it lately, almost complete. I would prefer to use Kud= u since we are already a Cloudera shop, and Kudu is easy to deploy and configu= re using Cloudera Manager. I also hope that some of Aerospike=E2=80=99s speed = optimization techniques can make it into Kudu in the future, if they have not = been already thought of or included.

Just some thoughts=E2=80=A6
=
Cheers,
Ben


-- 
--Mike Percy
Software Engineer, Cloudera






-- 
Todd Lipcon
Software Engineer, Clo= udera



-- 
Todd Lipcon
Software En= gineer, Cloudera





= --
Todd Lipcon
Software Engineer, Clou= dera

------=ALIBOUNDARY_14886_50262940_577a3070_30086--