hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
Date Wed, 12 Feb 2020 20:06:30 GMT
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on
archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378484755
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##########
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
           //read the avro blocks
           while (reader.hasNext()) {
             HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-            // TODO If we can store additional metadata in datablock, we can skip parsing
records
-            // (such as startTime, endTime of records in the block)
+            if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   No. In the current implementation, the first block tracks range for entire fire. In some
cases there are lot of archived files and its much faster to skip entire file when looking
at older ranges. 
   
   The overhead of storing metadata on every block seemed high. By default, we are grouping
10 records into one block. That translates to 10KB in size. Header on every block with min/max
is adding 40 bytes overhead. So, 0.4% overhead seemed  high to me. Let me know if you think
we can ignore overhead. I can move this to per block

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message