hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinoth Chandar (Jira)" <>
Subject [jira] [Created] (HUDI-574) CLI counts small file inserts as updates
Date Thu, 23 Jan 2020 18:52:00 GMT
Vinoth Chandar created HUDI-574:

             Summary: CLI counts small file inserts as updates
                 Key: HUDI-574
             Project: Apache Hudi (incubating)
          Issue Type: Bug
          Components: CLI
            Reporter: Vinoth Chandar
             Fix For: 0.6.0

User report : 
I'm trying to understand the {{.commit}} output and how it relates to the output from the
{{hudi-cli}} tool and i'm finding it difficult to reconcile my findings. specifically, i want
to know the number of updates/inserts/deletes across all partitions for a given commit (an
upsert). From the {{cli}}:
hudi:exec_unit_ver->commit showpartitions --commit 20200108153617 
║ Partition Path │ Total Files Added │ Total Files Updated │ Total Records Inserted
│ Total Records Updated │ Total Bytes Written │ Total Errors ║
║ 0              │ 0                 │ 9                   │ 0                   
  │ 2091                  │ 983.7 MB            │ 0            ║
But in the {{20200108153617.commit}} file for that commit one of the files in the partition
"0" has
      "numInserts" : 44448,
so not sure why {{Total Records Inserted}} is reported as zero. I checked that the sum of
{{numUpdateWrites}} across all files in the partition matches 2091. Generally, i think it
would be helpful to have {{totalRecordsInserted}} {{totalRecordsUpdated}} {{totalRecordsDeleted}}
in the commit metadata (although it's not a big issue to sum the individual numbers from each
file in each partition).
On the counts, when I checked the code, its counting the inserts as updats, since Hudi packed
them onto existing files, to honor target file size ..
for (HoodieWriteStat stat : stats) {
        if (stat.getPrevCommit().equals(HoodieWriteStat.NULL_COMMIT)) {
          totalFilesAdded += 1;
          totalRecordsInserted += stat.getNumWrites();
        } else {
          totalFilesUpdated += 1;
          totalRecordsUpdated += stat.getNumUpdateWrites();
        totalBytesWritten += stat.getTotalWriteBytes();
        totalWriteErrors += stat.getTotalWriteErrors();

This message was sent by Atlassian Jira

View raw message