hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] utk-spartan opened a new issue #1384: [SUPPORT] Hudi datastore missing updates for many records
Date Sat, 07 Mar 2020 19:42:26 GMT
utk-spartan opened a new issue #1384: [SUPPORT] Hudi datastore missing updates for many records
URL: https://github.com/apache/incubator-hudi/issues/1384
 
 
   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues)
directly.
   
   **Describe the problem you faced**
   
   Overview of the flow
   Mysql-Maxwell -> Kafka -> Spark preprocessing(Sorting, dedup etc.) -> Hudi upsert
via spark datasource writer (with Hivesync)
   
   Hudi tables in S3 are missing updates for some records.
   
   
   To pinpoint the issue in our entire flow we are writing dataframe to S3 after each stage
and we observed that all the updates are present in the dataframe upon which the hudi datasource
writer is called on, but some of these updates are applied in data present in hudi table.
   
   We were initially using 0.4.7 and have upgraded to hudi 0.5.1 and recreated the entire
hudi table but the issue still persists.
   
   The count of records matches exactly but we are not sure if inserts are also getting dropped
as any one of the captured update event for a record will create its entry, as everything
is treated as an upsert. We are analyzing our data currently for this scenario.
   
   These records having inconsistent updates don't seem to correspond to any pattern or table
size or batch size.
   Upon replaying the batch some of these missed updates are applied i.e. only some arbitrary
percent of updates are applied each time the batch is processed.
   
   We will be further digging in hudi code, and find a way to replicate it in non S3 env.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   Currently not able to reproduce this behaviour properly on our dev env, will update here.
   
   **Expected behavior**
   
   Both updates and inserts should be 100% consistent with source db.
   
   **Environment Description**
   
   * Hudi version : 0.5.1
   
   * Spark version : 2.4.0
   
   * Hive version : 2.3.0
   
   * Hadoop version : 2.6.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Logs generated from hudi and aws-sdk for s3 have no WARN or ERROR level statements, nothing
out of normal in INFO level logs.
   
   **Config params for datasource writer**
   DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"
   "hoodie.bulkinsert.shuffle.parallelism", "100"
   "hoodie.upsert.shuffle.parallelism", "100"
   "hoodie.insert.shuffle.parallelism", "100"
   HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 256 * 1024 * 1024
   HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 64 * 1024 * 1024
   HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, 2
   HIVE_SYNC_ENABLED_OPT_KEY, true
   <hive sync related opts>
   PARQUET_COMPRESSION_CODEC, "uncompressed"
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message