From dev-return-700-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Sun Jun 23 15:42:37 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 59DD0180679 for ; Sun, 23 Jun 2019 17:42:37 +0200 (CEST) Received: (qmail 1155 invoked by uid 500); 23 Jun 2019 15:42:36 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 1141 invoked by uid 99); 23 Jun 2019 15:42:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Jun 2019 15:42:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C4FB01A320D for ; Sun, 23 Jun 2019 15:42:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.049 X-Spam-Level: ** X-Spam-Status: No, score=2.049 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id heCf3i_feNcD for ; Sun, 23 Jun 2019 15:42:34 +0000 (UTC) Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com [209.85.167.67]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 5FEFD5FC88 for ; Sun, 23 Jun 2019 15:42:33 +0000 (UTC) Received: by mail-lf1-f67.google.com with SMTP id y17so8257864lfe.0 for ; Sun, 23 Jun 2019 08:42:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=z5s53z5YShj8Fk2k1JH+9kVhxF6syr337HDmgE1NaJM=; b=O6QKP23+KBKJRJzzOVQCOIybJHQii2nSZBFLgbvRMPj37TqBCrzrCNySLWeSw4Yizv xJtcS310Tbvm3Vb+dsYSjHNanMIotpctt8dOm42Kl4Uvcgr4rd3EClrSYTsPwXe2Buy2 AfxgjWADPIaIe52JJhuNgnMTfXRFeVPIfkL6p+aDNoatFrA08WXLT9c6qt8Jap+pHXQz OalQmMUORiE9bC6Vt7EyeWpFZWbBBE4lRJtXd2w90/RnQ3QdOs3xgWFFXZrlb3kHgEkQ Z9vxvIDgWu5h3CBSCoOclrIAUpXzCA0SdMMuJEESPudwlRzRUwcZMWvRFqhPH3rjjwNw ddBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=z5s53z5YShj8Fk2k1JH+9kVhxF6syr337HDmgE1NaJM=; b=Goin0s+CnKWNy2XSBEh8AG92RiLQQbFhFBm1VDmZv2sQ87k5sFOvJ9KrElcSA8fyXN 1DZQKpLBIzKjEv6b8LvOdLiUABSoAhHdb1m9T6v6kJvnM7Ow7WV1T3nIkQ1bXiJEaaXC kCQw6zMB07FIG4ZFhpDlsp+k4BY8BswD+QOu0/y2x0VumckH1GAqIpMJOhWTn6QN3B10 kkQFBYXKJEzQeQAAjZq5h3ZW/tyXv3m07wCOBMyBivP2i1Ys5NrgFpBvd4OpP+r6qoF2 eppAiTbrFG0xvlH2964H0Kpzhiw5l1QfIm7jNYk6qKZ7HUSfdceDWC3IE0oZHyhDP4zH Vfqw== X-Gm-Message-State: APjAAAXjSvTR9b3gcXGQQYIDALvZWFMI7omYzTerBDiuj8GLx8XpU9cs Ckx7E0lCyJV0WxhIVqhz4mmusdJAg3xou9bZ4QIXA9as X-Google-Smtp-Source: APXvYqzZT/+DOb4B6DbtBWKUQkQX08qGIHbM72BSMdPAC5eoYKyhv3H9JGFuryEpk+bRl1XbmvgN8w/9nje2oR3qwVk= X-Received: by 2002:ac2:4565:: with SMTP id k5mr594210lfm.170.1561304551804; Sun, 23 Jun 2019 08:42:31 -0700 (PDT) MIME-Version: 1.0 References: <1190659606.266901.1561208916270@mail.yahoo.com> <871750259.263135.1561210137908@mail.yahoo.com> In-Reply-To: <871750259.263135.1561210137908@mail.yahoo.com> From: Netsanet Gebretsadkan Date: Sun, 23 Jun 2019 17:42:20 +0200 Message-ID: Subject: Re: Add checkpoint metadata while using HoodieSparkSQLWriter To: dev@hudi.apache.org Content-Type: multipart/alternative; boundary="0000000000003251ad058bff8bee" --0000000000003251ad058bff8bee Content-Type: text/plain; charset="UTF-8" Thanks Vbalaji. I will check it out. Kind regards, On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org wrote: > > Here is the correct gist link : > https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626 > > > On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org < > vbalaji@apache.org> wrote: > > Hi, > I have given a sample command to set up and run deltastreamer in > continuous mode and ingest fake data in the following gist > https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7 > > We will eventually get this to project wiki. > Balaji.V > > On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan < > net22geb@gmail.com> wrote: > > @Vinoth, Thanks , that would be great if Balaji could share it. > > Kind regards, > > > On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar wrote: > > > Hi, > > > > We usually test with our production workloads.. However, balaji recently > > merged a DistributedTestDataSource, > > > > > https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d > > > > > > that can generate some random data for testing.. Balaji, do you mind > > sharing a command that can be used to kick something off like that? > > > > > > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan < > net22geb@gmail.com> > > wrote: > > > > > Dear Vinoth, > > > > > > I want to try to check out the performance comparison of upsert and > bulk > > > insert. But i couldn't find a clean data set more than 10 GB. > > > Would it be possible to get a data set from Hudi team? For example i > was > > > using the stocks data that you provided on your demo. Hence, can i get > > > more GB's of that dataset for my experiment? > > > > > > Thanks for your consideration. > > > > > > Kind regards, > > > > > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar > wrote: > > > > > > > > > > > > > https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159 > > > > > > > > Just circling back with the resolution on the mailing list as well. > > > > > > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan < > > net22geb@gmail.com > > > > > > > > wrote: > > > > > > > > > Dear Vinoth, > > > > > > > > > > Thanks for your fast response. > > > > > I have created a new issue called Performance Comparison of > > > > > HoodieDeltaStreamer and DataSourceAPI #714 with the screnshots of > > the > > > > > spark UI which can be found at the following link > > > > > https://github.com/apache/incubator-hudi/issues/714. > > > > > In the UI, it seems that the ingestion with the data source API is > > > > > spending much time in the count by key of HoodieBloomIndex and > > > workload > > > > > profile. Looking forward to receive insights from you. > > > > > > > > > > Kinde regards, > > > > > > > > > > > > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > Both datasource and deltastreamer use the same APIs underneath. > So > > > not > > > > > > sure. If you can grab screenshots of spark UI for both and open a > > > > ticket, > > > > > > glad to take a look. > > > > > > > > > > > > On 2, well one of goals of Hudi is to break this dichotomy and > > enable > > > > > > streaming style (I call it incremental processing) of processing > > even > > > > in > > > > > a > > > > > > batch job. MOR is in production at uber. Atm MOR is lacking just > > one > > > > > > feature (incr pull using log files) that Nishith is planning to > > merge > > > > > soon. > > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while > > > > managing > > > > > > compaction etc in the same job. I already knocked off some index > > > > > > performance problems and working on indexing the log files, which > > > > should > > > > > > unlock near real time ingest. > > > > > > > > > > > > Putting all these together, within a month or so near real time > MOR > > > > > vision > > > > > > should be very real. Ofc we need community help with dev and > > testing > > > to > > > > > > speed things up. :) > > > > > > > > > > > > Hope that gives you a clearer picture. > > > > > > > > > > > > Thanks > > > > > > Vinoth > > > > > > > > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan < > > > > net22geb@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks, Vinoth > > > > > > > > > > > > > > Its working now. But i have 2 questions: > > > > > > > 1. The ingestion latency of using DataSource API with > > > > > > > the HoodieSparkSQLWriter is high compared to using delta > > > streamer. > > > > > Why > > > > > > is > > > > > > > it slow? Are there specific option where we could specify to > > > minimize > > > > > the > > > > > > > ingestion latency. > > > > > > > For example: when i run the delta streamer its talking > about 1 > > > > > minute > > > > > > to > > > > > > > insert some data. If i use DataSource API with > > > HoodieSparkSQLWriter, > > > > > its > > > > > > > taking 5 minutes. How can we optimize this? > > > > > > > 2. Where do we categorize Hudi in general (Is it batch > processing > > > or > > > > > > > streaming)? I am asking this because currently the copy on > write > > > is > > > > > the > > > > > > > one which is fully working and since the functionality of the > > merge > > > > on > > > > > > read > > > > > > > is not fully done which enables us to have a near real time > > > > analytics, > > > > > > can > > > > > > > we consider Hudi as a batch job? > > > > > > > > > > > > > > Kind regards, > > > > > > > > > > > > > > > > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar < > > vinoth@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > Short answer, by default any parameter you pass in using > > > > option(k,v) > > > > > or > > > > > > > > options() beginning with "_" would be saved to the commit > > > metadata. > > > > > > > > You can change "_" prefix to something else by using the > > > > > > > > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(). > > > > > > > > Reason you are not seeing the checkpointstr inside the commit > > > > > metadata > > > > > > is > > > > > > > > because its just supposed to be a prefix for all such commit > > > > > metadata. > > > > > > > > > > > > > > > > val metaMap = parameters.filter(kv => > > > > > > > > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY))) > > > > > > > > > > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan < > > > > > > > net22geb@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data > > from > > > > any > > > > > > > > > dataframe into a hoodie modeled table. Its creating > > everything > > > > > > > correctly > > > > > > > > > but , i also want to save the checkpoint but i couldn't > even > > > > though > > > > > > am > > > > > > > > > passing it as an argument. > > > > > > > > > > > > > > > > > > inputDF.write() > > > > > > > > > .format("com.uber.hoodie") > > > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), > > > > > "_row_key") > > > > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), > > > > > > > > "partition") > > > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), > > > > > > "timestamp") > > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName) > > > > > > > > > > > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(), > > > > > > > > > checkpointstr) > > > > > > > > > .mode(SaveMode.Append) > > > > > > > > > .save(basePath); > > > > > > > > > > > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for > > inserting > > > > the > > > > > > > > > checkpoint while using the dataframe writer but i couldn't > > add > > > > the > > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there > a > > > way > > > > i > > > > > > can > > > > > > > > add > > > > > > > > > the checkpoint meta data while using the dataframe writer > > API? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --0000000000003251ad058bff8bee--