hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] umehrot2 commented on issue #1371: [SUPPORT] Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing
Date Thu, 05 Mar 2020 00:12:34 GMT
umehrot2 commented on issue #1371: [SUPPORT] Upsert for S3 Hudi dataset with large partitions
takes a lot of time in writing
URL: https://github.com/apache/incubator-hudi/issues/1371#issuecomment-594955784
 
 
   @vinothchandar this is exactly what I was talking about. This easily becomes a bottleneck
as the driver spends time filtering out the files that it gets from `InMemoryFileIndex` as
filtering is not distributed. My suggestion here is, at the time of ingestion we just return
an `EmptyRelation` once **HoodieSparkSqlWriter** has done its job, because write now we end
up creating a relation even at write time using parquet data source which is really not necessary
for our use-case. I have been testing this internally for the past week.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message