hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs
Date Thu, 13 Feb 2020 02:05:55 GMT
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: Create a data
frame with 4m rows and one column with values 1, 2, 3....4m in that column. Bulk insert that
into Hudi (using the one column as the `recordkey`). This takes ~1 minute to run and the data
size is about 30MB. Now upsert the same data frame into the table a second time. This take
>2 hours to run.
   
   Alternatively, if we upsert a new data frame with values 4000001...8m (still 4m rows upserted),
this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  {{HoodieSparkSqlWriter}} job (and within that
job, the {{count at HoodieSparkSqlWriter.scala}} stage (the BloomIndex parts run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small example.
   
   Shall i raise a Jira for this? Or is this the expected behavior for such a workload?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message