hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] satishkotha edited a comment on issue #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction
Date Thu, 12 Mar 2020 20:44:19 GMT
satishkotha edited a comment on issue #1396: [HUDI-687] Stop incremental reader on RO table
before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#issuecomment-598409536
 
 
   > > if compaction at t2 takes a long time, incremental reads using HoodieParquetInputFormat
may make progress to read commits at t3
   > 
   > IIUC this is because you are incremental pulling from the parquet only table? I thought
we can already incremental pull via logs. no? cc @n3nash .. is this really needed since it
will add complexity to the system..
   > 
   > Eventually, I would like incremental query/pull on MOR to be just based on logs..
   
   Based on view type, hudi decides the input format to use (see https://github.com/apache/incubator-hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java#L91
and line 143) . For RO views, we use HoodieParquetInputFormat, which does not read log files.
For RT views, we use HoodieParquetRealtimeInputFormat, which reads slice including log files.
In my limited testing, incremental reads on RT views also do not work well (we see duplicates
after compaction in some conditions).  @bvaradar  is working on fixing any broken windows
for supporting incremental reads on RT views.
   
   We wanted to include this change for supporting RO views (which is majority of use cases
for us). I agree with you that this is additional complexity. I added more tests than usual
because of that. 
   
   Other alternatives i can think of:
   1) Support incremental reads only for RT views.  incremental reads on RO can fail or use
RT (is this your proposal in the above comment?)
   2) Instead of doing incremental reads based on hoodie commit time, use parquet file creation
times. This approach requires substantial changes and likely be breaking some fundamental
assumptions.
   
   Also, at a high level, I want to discuss adding additional mode for incremental reads.
Today, its responsibility of hoodie users to save commit times and use that for next incremental
reads. Can we add 'kafka consumer' model, where consumer only specifies their unique-id. Hudi
tracks read progress (perhaps as part of consolidated metadata?). This would simplify usage
and make debugging lot easier.
    
   fyi,  @n3nash is out of office for next 10 days. @bvaradar likely can share more context.
Let me know if you have other suggestions.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message