hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Netsanet Gebretsadkan <net22...@gmail.com>
Subject Re: Add checkpoint metadata while using HoodieSparkSQLWriter
Date Mon, 03 Jun 2019 08:00:40 GMT
Thanks, Vinoth

Its working now. But i have 2 questions:
1. The ingestion latency of using DataSource API with
the  HoodieSparkSQLWriter  is high compared to using delta streamer. Why is
it slow? Are there specific option where we could specify to minimize the
ingestion latency.
   For example: when i run the delta streamer its talking about 1 minute to
insert some data. If i use DataSource API with HoodieSparkSQLWriter, its
taking 5 minutes. How can we optimize this?
2. Where do we categorize Hudi in general (Is it batch processing or
streaming)?  I am asking this because currently the copy on write is the
one which is fully working and since the functionality of the merge on read
is not fully done which enables us to have a near real time analytics, can
we consider Hudi as a batch job?

Kind regards,


On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vinoth@apache.org> wrote:

> Hi,
>
> Short answer, by default any parameter you pass in using option(k,v) or
> options() beginning with "_" would be saved to the commit metadata.
> You can change "_" prefix to something else by using the
>  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> Reason you are not seeing the checkpointstr inside the commit metadata is
> because its just supposed to be a prefix for all such commit metadata.
>
> val metaMap = parameters.filter(kv =>
> kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
>
> On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <net22geb@gmail.com>
> wrote:
>
> > I am trying to use the HoodieSparkSQLWriter to upsert data from any
> > dataframe into a hoodie modeled table.  Its creating everything correctly
> > but , i also want to save the checkpoint but i couldn't even though am
> > passing it as an argument.
> >
> > inputDF.write()
> > .format("com.uber.hoodie")
> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> "partition")
> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
> > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > checkpointstr)
> > .mode(SaveMode.Append)
> > .save(basePath);
> >
> > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the
> > checkpoint while using the dataframe writer but i couldn't add the
> > checkpoint meta data in to the .hoodie meta data. Is there a way i can
> add
> > the checkpoint meta data while using the dataframe writer API?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message