hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1328: Hudi upsert hangs
Date Thu, 13 Feb 2020 01:28:15 GMT
vinothchandar edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585503340
 
 
   Reposting my response here.. 
   
   There seems to be a lot of common concerns here.. https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
is an useful resource, that hopefully can benefit here..
   
   Few high level thoughts:
   - It would be good to layout if the most time spent is on the indexing stages (ones tagged
with HoodieBloomIndex) or the actual writing.. 
   - Hudi does keep the input in memory to compute the stats it needs to size files. So if
you don't provide sufficient executore/rdd storage memory, it will spill and can cause slowdowns..
(covered in tuning guide & have seen this happen with users often)
   - On workload pattern itself, BloomIndex range pruning can be turned off https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges
if the keys ranges are random anyway.. Generally speaking, unless we have RFC-8 (record level
indexing), cases of random write/upserting majority of the rows in a table, may give bloom
index overhead, since the bloom filters/ranges are not at all useful in pruning out files
. We have an interim solution coming out in the next release.. falling back to plain old join
to implement the indexing. 
   - In terms or MOR and COW, MOR will help only if you have lots of updates and bottleneck
is on the writing.. 
   - If listing is an issue, please turn the following so the table is listed once and we
re-use the filesytem metadata hoodie.embed.timeline.server=true
   
   I would appreciate a JIRA, so that I can break each into sub-task and tackle/resolve independently..
   
   
   I am personally focussing on performance now and want to make it lot faster in 0.6.0 release.
So all this help would be deeply appreciated

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message