From commits-return-14154-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Wed Mar 25 09:06:39 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7C97818063D for ; Wed, 25 Mar 2020 10:06:39 +0100 (CET) Received: (qmail 2028 invoked by uid 500); 25 Mar 2020 09:06:38 -0000 Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list commits@hudi.apache.org Received: (qmail 2019 invoked by uid 99); 25 Mar 2020 09:06:38 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2020 09:06:38 +0000 From: GitBox To: commits@hudi.apache.org Subject: [GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema Message-ID: <158512719880.29319.8403758236300583439.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Wed, 25 Mar 2020 09:06:38 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit umehrot2 commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r397699488 ########## File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestHoodieAvroUtils.java ########## @@ -57,4 +60,16 @@ public void testPropsPresent() { } Assert.assertTrue("column pii_col doesn't show up", piiPresent); } + + @Test + public void testDefaultValue() { + GenericRecord rec = new GenericData.Record(new Schema.Parser().parse(EXAMPLE_SCHEMA)); + rec.put("_row_key", "key1"); + rec.put("non_pii_col", "val1"); + rec.put("pii_col", "val2"); + rec.put("timestamp", 3.5); Review comment: My bad I was thinking only from `DataSource's HoodieSparkSqlWriter` writer point of view, where the schema is determined automatically from the `DataFrame` and converted to avro schema. Missed that `DeltaStreamer` uses the `schema provider` which the users can pass it directly to the `HoodieWriteClient`. Thanks for details ! I have a question for the schema evolution example you provided. The `rewriteRecord()` you are testing here uses the schema from the old record, and re-writes by setting only the fields found in the old schema. So if you rewrite R1 and R2 record, there schema will not have the new `col1` field right ? Hence, your code of populating default values will not get executed because `col1` is not present in the old schema fields. It seems this test case works because you are not evolving the schema here. Your old and new record both have the same schema. But if your old record schema is different I think you will run into the same issue. Am I missing something here ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services