hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Netsanet Gebretsadkan <net22...@gmail.com>
Subject Re: Add checkpoint metadata while using HoodieSparkSQLWriter
Date Thu, 20 Jun 2019 08:53:56 GMT
Dear Vinoth,

I want to try to check out the performance comparison of upsert and bulk
insert.  But i couldn't find a clean data set more than 10 GB.
Would it be possible to get a data set from Hudi team? For example i was
using the stocks data that you provided on your demo. Hence, can i get
more GB's of that dataset for my experiment?

Thanks for your consideration.

Kind regards,

On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vinoth@apache.org> wrote:

> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
>
> Just circling back with the resolution on the mailing list as well.
>
> On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <net22geb@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > Thanks for your fast response.
> > I have created a new issue called Performance Comparison of
> > HoodieDeltaStreamer and DataSourceAPI #714   with the screnshots of the
> > spark UI which can be found at the  following  link
> > https://github.com/apache/incubator-hudi/issues/714.
> > In the UI,  it seems that the ingestion with the data source API is
> > spending  much time in the count by key of HoodieBloomIndex and workload
> > profile.  Looking forward to receive insights from you.
> >
> > Kinde regards,
> >
> >
> > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vinoth@apache.org> wrote:
> >
> > > Hi,
> > >
> > > Both datasource and deltastreamer use the same APIs underneath. So not
> > > sure. If you can grab screenshots of spark UI for both and open a
> ticket,
> > > glad to take a look.
> > >
> > > On 2, well one of goals of Hudi is to break this dichotomy and enable
> > > streaming style (I call it incremental processing) of processing even
> in
> > a
> > > batch job. MOR is in production at uber. Atm MOR is lacking just one
> > > feature (incr pull using log files) that Nishith is planning to merge
> > soon.
> > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> managing
> > > compaction etc in the same job. I already knocked off some index
> > > performance problems and working on indexing the log files, which
> should
> > > unlock near real time ingest.
> > >
> > > Putting all these together, within a month or so near real time MOR
> > vision
> > > should be very real. Ofc we need community help with dev and testing to
> > > speed things up. :)
> > >
> > > Hope that gives you a clearer picture.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> net22geb@gmail.com
> > >
> > > wrote:
> > >
> > > > Thanks, Vinoth
> > > >
> > > > Its working now. But i have 2 questions:
> > > > 1. The ingestion latency of using DataSource API with
> > > > the  HoodieSparkSQLWriter  is high compared to using delta streamer.
> > Why
> > > is
> > > > it slow? Are there specific option where we could specify to minimize
> > the
> > > > ingestion latency.
> > > >    For example: when i run the delta streamer its talking about 1
> > minute
> > > to
> > > > insert some data. If i use DataSource API with HoodieSparkSQLWriter,
> > its
> > > > taking 5 minutes. How can we optimize this?
> > > > 2. Where do we categorize Hudi in general (Is it batch processing or
> > > > streaming)?  I am asking this because currently the copy on write is
> > the
> > > > one which is fully working and since the functionality of the merge
> on
> > > read
> > > > is not fully done which enables us to have a near real time
> analytics,
> > > can
> > > > we consider Hudi as a batch job?
> > > >
> > > > Kind regards,
> > > >
> > > >
> > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vinoth@apache.org>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Short answer, by default any parameter you pass in using
> option(k,v)
> > or
> > > > > options() beginning with "_" would be saved to the commit metadata.
> > > > > You can change "_" prefix to something else by using the
> > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > Reason you are not seeing the checkpointstr inside the commit
> > metadata
> > > is
> > > > > because its just supposed to be a prefix for all such commit
> > metadata.
> > > > >
> > > > > val metaMap = parameters.filter(kv =>
> > > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > >
> > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > net22geb@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from
> any
> > > > > > dataframe into a hoodie modeled table.  Its creating everything
> > > > correctly
> > > > > > but , i also want to save the checkpoint but i couldn't even
> though
> > > am
> > > > > > passing it as an argument.
> > > > > >
> > > > > > inputDF.write()
> > > > > > .format("com.uber.hoodie")
> > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > "_row_key")
> > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > "partition")
> > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > "timestamp")
> > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > >
> .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > checkpointstr)
> > > > > > .mode(SaveMode.Append)
> > > > > > .save(basePath);
> > > > > >
> > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting
> the
> > > > > > checkpoint while using the dataframe writer but i couldn't add
> the
> > > > > > checkpoint meta data in to the .hoodie meta data. Is there a
way
> i
> > > can
> > > > > add
> > > > > > the checkpoint meta data while using the dataframe writer API?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message