From dev-return-698-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Sat Jun 22 13:29:12 2019
Return-Path: <dev-return-698-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 0E553180627
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 22 Jun 2019 15:29:11 +0200 (CEST)
Received: (qmail 24300 invoked by uid 500); 22 Jun 2019 13:29:11 -0000
Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@hudi.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@hudi.apache.org>
List-Post: <mailto:dev@hudi.apache.org>
List-Id: <dev.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list dev@hudi.apache.org
Received: (qmail 24288 invoked by uid 99); 22 Jun 2019 13:29:10 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Jun 2019 13:29:10 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 7E8151A4558
	for <dev@hudi.apache.org>; Sat, 22 Jun 2019 13:29:09 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.223
X-Spam-Level: ***
X-Spam-Status: No, score=3.223 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
	SPF_SOFTFAIL=0.972] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=yahoo.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id ytvGXkfCqLQs for <dev@hudi.apache.org>;
	Sat, 22 Jun 2019 13:29:06 +0000 (UTC)
Received: from sonic306-2.consmr.mail.bf2.yahoo.com (sonic306-2.consmr.mail.bf2.yahoo.com [74.6.132.41])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 119155F485
	for <dev@hudi.apache.org>; Sat, 22 Jun 2019 13:29:05 +0000 (UTC)
X-YMail-OSG: YmGYENAVM1n1IY88awlsxhfHkIxBB.vMdO9Go3lvtB42L9TbBXUV_RmtUuzxTxN
 _UyOW.9vC2cLqcmJAwADB.HGKsBcGLpKtkC_sxpl6ynfWPLnE64VeyfJSkbp97vir7aPcFw_8Yle
 RyBbv_VF7ZFO41.lwTPcDomPcWkUMdWW4I3hIju8mTF5zda5Jet4c3bX4TucMIK188OxsTOHKALq
 QfDMQEv0tIZeqnl9ZE1zz370t9xY4CsT8t54PPcZeP3Qyb8RRaZTmw95P1BQoUh8Zv3wiKo6StAp
 378psP96K_vMmxm48pL_XPDu7EyYoXkXvMGggLJnRr05O1GfcCtjRcB_TjuxBvr.M6nFTY9BpnyR
 YD7NNrZAQv0ZfxyHtdN9GgnZb5sl9rEThSXQJHVJw4_XhAldm52u1OpN67.qw9cnn.Fy0Jrnrm9O
 ooqW.NF8BTPfmQzwemT7xMi.iMazWngPD8r7mbbykssfj2JXDqzXYVttJtZpSQ0ZE4N8xaWOT5cZ
 eUuS3tyKgTV7A0ngOI9MF7j_txaXJa2fUknyfcVTDhNrmpCnGXp0NgX6IwQNCPmDv4WPbX4Ci8va
 qchE4TSV.nGtcQ_d8CTZlzfox6qaI9FKRCyQKQEc.Sbbjlq47DoSLd5pR2C57OstM5C5RPJbFMS6
 didndeHjka0rvuReydRL.j6.vdUflHIUiMvQZEKOHEIEMn3m2SWVsM61RJbfFW.vbMDVWaoXEZBr
 BTq.UIfe8XPUrbVi7pcArsjucMJbuhvEFJKRHjFEZEc_0UjNfoAl04PWYjb736YiLlagoD7n9nLb
 xkwQoj2lx_Xt3.vj07Mx_KVni7nPQgkzaLfUNam3XdlSMlnYRGQx3qPnj_nXk7WdmKyYz3OFF4tA
 BxPQGhIHs5TRIlsAEp75xzJVE.nKyd.lYHy8itoDgOS3Dfo_.GC2nnfFTMfeUwqD3X0EfxJdMW_G
 SVDXYmcM7pxtHdj.sNx1PLNvBZYQzXCyPnwxkdJzaSq.ToGwOqNQMvW0Cu_tg7.7Pg6f0fdHOAKa
 DxCRkgZtnTyih8VNeiOcvh4mqy4zlZSsYTHQW4YL8qlEJJqPa2QZspAIeTgBmQmQqXNVvznaLb9F
 Sic.cOMAJ4ycGUkmVyWUhh7N4y3N_Iuki08l3krUft2M5pGNvqsvG3i_nfRXiNyFQM0NZRdI4f5W
 JDfyt3C9JOoDJkJs_PtRu6GXmNZdfGN9h4O0V5n0LNhpf16msrl0ty9t1Rtl4gti0L36RaHIxexN
 pD2_Y
Received: from sonic.gate.mail.ne1.yahoo.com by sonic306.consmr.mail.bf2.yahoo.com with HTTP; Sat, 22 Jun 2019 13:28:58 +0000
Date: Sat, 22 Jun 2019 13:28:57 +0000 (UTC)
From: "vbalaji@apache.org" <vbalaji@apache.org>
To: dev@hudi.apache.org
Message-ID: <871750259.263135.1561210137908@mail.yahoo.com>
In-Reply-To: <1190659606.266901.1561208916270@mail.yahoo.com>
References: <CADTZSaVw_vvJZpx9aNnJVJ=zU3OJaONg4348CzGvrtFgW9YwLQ@mail.gmail.com> <CAKw-+5QUXTqT30s0yVnT-mj4dxbAM0b+oQ37msjvKxQCUS_y8g@mail.gmail.com> <CADTZSaVL9canV9R43MpT6ibH2_13vAqGD-nUcNJ2VkEBR2V+vA@mail.gmail.com> <CAKw-+5Qd+MKGaTSvXtyU9vb2Mwd9_C-3+zKOCcF-fpDhXiApqw@mail.gmail.com> <CADTZSaV3Qf0TnhX0GUB7EDMz_iyJRfQFCrCMmy8S6rkg66PuPQ@mail.gmail.com> <CAKw-+5QvZhbS8ZPqLCnd_rq5e0BZq7_wYtu-R2HnY=uJi+diEg@mail.gmail.com> <CADTZSaXj5Odt9JC_WKgfTSADyms7f41GxfnGFBHAr5rCPk1DWg@mail.gmail.com> <CAKw-+5RgO61ma+J6VE9dVtbtLvDue-wNaHmei7Hy2PS7+EwsnQ@mail.gmail.com> <CADTZSaWgu8uMzEC4g9dsrhSndj9_nrfLg4-1_fOqKfqb0Qkovg@mail.gmail.com> <1190659606.266901.1561208916270@mail.yahoo.com>
Subject: Re: Add checkpoint metadata while using HoodieSparkSQLWriter
MIME-Version: 1.0
Content-Type: multipart/alternative; 
	boundary="----=_Part_263134_262601664.1561210137905"
X-Mailer: WebService/1.1.13837 YMailNorrin Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36

------=_Part_263134_262601664.1561210137905
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

=20
Here is the correct gist link :=C2=A0https://gist.github.com/bvaradar/e18d9=
6f9b99980dfb67a6601de5aa626


    On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <vbalaji=
@apache.org> wrote: =20
=20
  Hi,
I have given a sample command to set up and run deltastreamer in continuous=
 mode and ingest fake data in the following gist
https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7

We will eventually get this to project wiki.
Balaji.V

=C2=A0 =C2=A0 On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadk=
an <net22geb@gmail.com> wrote:=C2=A0=20
=20
 @Vinoth, Thanks , that would be great if Balaji could share it.

Kind regards,


On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vinoth@apache.org> wrote:

> Hi,
>
> We usually test with our production workloads.. However, balaji recently
> merged a DistributedTestDataSource,
>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e=
1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>
>
> that can generate some random data for testing..=C2=A0 Balaji, do you min=
d
> sharing a command that can be used to kick something off like that?
>
>
> On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <net22geb@gmail.com=
>
> wrote:
>
> > Dear Vinoth,
> >
> > I want to try to check out the performance comparison of upsert and bul=
k
> > insert.=C2=A0 But i couldn't find a clean data set more than 10 GB.
> > Would it be possible to get a data set from Hudi team? For example i wa=
s
> > using the stocks data that you provided on your demo. Hence, can i get
> > more GB's of that dataset for my experiment?
> >
> > Thanks for your consideration.
> >
> > Kind regards,
> >
> > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vinoth@apache.org> wrote=
:
> >
> > >
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-49998115=
9
> > >
> > > Just circling back with the resolution on the mailing list as well.
> > >
> > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> net22geb@gmail.com
> > >
> > > wrote:
> > >
> > > > Dear Vinoth,
> > > >
> > > > Thanks for your fast response.
> > > > I have created a new issue called Performance Comparison of
> > > > HoodieDeltaStreamer and DataSourceAPI #714=C2=A0 with the screnshot=
s of
> the
> > > > spark UI which can be found at the=C2=A0 following=C2=A0 link
> > > > https://github.com/apache/incubator-hudi/issues/714.
> > > > In the UI,=C2=A0 it seems that the ingestion with the data source A=
PI is
> > > > spending=C2=A0 much time in the count by key of HoodieBloomIndex an=
d
> > workload
> > > > profile.=C2=A0 Looking forward to receive insights from you.
> > > >
> > > > Kinde regards,
> > > >
> > > >
> > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vinoth@apache.org>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Both datasource and deltastreamer use the same APIs underneath. S=
o
> > not
> > > > > sure. If you can grab screenshots of spark UI for both and open a
> > > ticket,
> > > > > glad to take a look.
> > > > >
> > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> enable
> > > > > streaming style (I call it incremental processing) of processing
> even
> > > in
> > > > a
> > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
> one
> > > > > feature (incr pull using log files) that Nishith is planning to
> merge
> > > > soon.
> > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > > managing
> > > > > compaction etc in the same job. I already knocked off some index
> > > > > performance problems and working on indexing the log files, which
> > > should
> > > > > unlock near real time ingest.
> > > > >
> > > > > Putting all these together, within a month or so near real time M=
OR
> > > > vision
> > > > > should be very real. Ofc we need community help with dev and
> testing
> > to
> > > > > speed things up. :)
> > > > >
> > > > > Hope that gives you a clearer picture.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > > net22geb@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks, Vinoth
> > > > > >
> > > > > > Its working now. But i have 2 questions:
> > > > > > 1. The ingestion latency of using DataSource API with
> > > > > > the=C2=A0 HoodieSparkSQLWriter=C2=A0 is high compared to using =
delta
> > streamer.
> > > > Why
> > > > > is
> > > > > > it slow? Are there specific option where we could specify to
> > minimize
> > > > the
> > > > > > ingestion latency.
> > > > > >=C2=A0 =C2=A0 For example: when i run the delta streamer its tal=
king about 1
> > > > minute
> > > > > to
> > > > > > insert some data. If i use DataSource API with
> > HoodieSparkSQLWriter,
> > > > its
> > > > > > taking 5 minutes. How can we optimize this?
> > > > > > 2. Where do we categorize Hudi in general (Is it batch processi=
ng
> > or
> > > > > > streaming)?=C2=A0 I am asking this because currently the copy o=
n write
> > is
> > > > the
> > > > > > one which is fully working and since the functionality of the
> merge
> > > on
> > > > > read
> > > > > > is not fully done which enables us to have a near real time
> > > analytics,
> > > > > can
> > > > > > we consider Hudi as a batch job?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> vinoth@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Short answer, by default any parameter you pass in using
> > > option(k,v)
> > > > or
> > > > > > > options() beginning with "_" would be saved to the commit
> > metadata.
> > > > > > > You can change "_" prefix to something else by using the
> > > > > > >=C2=A0 DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KE=
Y().
> > > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > > metadata
> > > > > is
> > > > > > > because its just supposed to be a prefix for all such commit
> > > > metadata.
> > > > > > >
> > > > > > > val metaMap =3D parameters.filter(kv =3D>
> > > > > > >
> kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > > >
> > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > > net22geb@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
> from
> > > any
> > > > > > > > dataframe into a hoodie modeled table.=C2=A0 Its creating
> everything
> > > > > > correctly
> > > > > > > > but , i also want to save the checkpoint but i couldn't eve=
n
> > > though
> > > > > am
> > > > > > > > passing it as an argument.
> > > > > > > >
> > > > > > > > inputDF.write()
> > > > > > > > .format("com.uber.hoodie")
> > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > > "_row_key")
> > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(=
),
> > > > > > > "partition")
> > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > > "timestamp")
> > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > > >
> > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > > checkpointstr)
> > > > > > > > .mode(SaveMode.Append)
> > > > > > > > .save(basePath);
> > > > > > > >
> > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> inserting
> > > the
> > > > > > > > checkpoint while using the dataframe writer but i couldn't
> add
> > > the
> > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there =
a
> > way
> > > i
> > > > > can
> > > > > > > add
> > > > > > > > the checkpoint meta data while using the dataframe writer
> API?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
=C2=A0  =20
------=_Part_263134_262601664.1561210137905--