hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs
Date Thu, 13 Feb 2020 01:27:16 GMT
vinothchandar commented on issue #1328: Hudi upsert hangs
   Reposting my response here.. 
   There seems to be a lot of common concerns here..
is an useful resource, that hopefully can benefit here..
   Few high level thoughts:
   It would be good to layout if the most time spent is on the indexing stages (ones tagged
with HoodieBloomIndex) or the actual writing.. 
   Hudi does keep the input in memory to compute the stats it needs to size files. So if you
don't provide sufficient executore/rdd storage memory, it will spill and can cause slowdowns..
(covered in tuning guide & have seen this happen with users often)
   On workload pattern itself, BloomIndex range pruning can be turned off
if the keys ranges are random anyway.. Generally speaking, unless we have RFC-8 (record level
indexing), cases of random write/upserting majority of the rows in a table, may give bloom
index overhead, since the bloom filters/ranges are not at all useful in pruning out files
. We have an interim solution coming out in the next release.. falling back to plain old join
to implement the indexing. 
   In terms or MOR and COW, MOR will help only if you have lots of updates and bottleneck
is on the writing.. 
   If listing is an issue, please turn the following so the table is listed once and we re-use
the filesytem metadata hoodie.embed.timeline.server=true
   I would appreciate a JIRA, so that I can break each into sub-task and tackle/resolve independently..
   I am personally focussing on performance now and want to make it lot faster in 0.6.0 release.
So all this help would be deeply appreciated

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message