From dev-return-106118-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Mon Jul 29 16:30:21 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3D99A18063F for ; Mon, 29 Jul 2019 18:30:21 +0200 (CEST) Received: (qmail 98348 invoked by uid 500); 29 Jul 2019 16:30:16 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 98329 invoked by uid 99); 29 Jul 2019 16:30:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jul 2019 16:30:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 06EB1C04F3 for ; Mon, 29 Jul 2019 16:30:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.801 X-Spam-Level: * X-Spam-Status: No, score=1.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=confluent.io Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id kbGSRwt1R41j for ; Mon, 29 Jul 2019 16:30:12 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::b44; helo=mail-yb1-xb44.google.com; envelope-from=almog@confluent.io; receiver= Received: from mail-yb1-xb44.google.com (mail-yb1-xb44.google.com [IPv6:2607:f8b0:4864:20::b44]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id E14707DC04 for ; Mon, 29 Jul 2019 16:30:11 +0000 (UTC) Received: by mail-yb1-xb44.google.com with SMTP id q5so10028734ybp.1 for ; Mon, 29 Jul 2019 09:30:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=confluent.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=G2IaJ6ut6ildhoVtBXjBEV0x3RMcEf8I6DZJPL+wQwo=; b=al9BwExZZOU/r7F20pQOe2TDD80M/XPvVTqySVW0ZwjkbcHOuSqkrBiRDo2yxKe5u/ F+JK2Sdv/S6K2E3h2W4eP4dQF2i9Vu9VJ4eUuEzv1ImnuxpGqpnBRYI2nJzRenh6Y6SQ XtclDK68x40c+AwHTM5SMUpC+jBJfpwgnPdmg4ohZ/nGc2CI0ylBZ82ni5jgrNVflUsu KUcVkCrRCj74EEhFgfNy5p2G0zj2mnXWLn0uTW30GyEJGqQU5x+WDQEYp0uYdCByQdYi 54TlmaYVJ9uh7EZellYSdA4gMVsbpeYkKhriL0Gqq0Ppw+H4cmUizT00JOePaUXxXq+p V2DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=G2IaJ6ut6ildhoVtBXjBEV0x3RMcEf8I6DZJPL+wQwo=; b=N/irTF8vLsaDW60GxKjNapZQBv9FSgTJJHHoL4vk3+fO0/JnDsJD2DuL/6Lujk+m35 bqEL/lb+X0XIa7637lRKODazJ4/GKS4gxlGcS/DzrZwT26G2/kULUgakP3TapDKkLTPF hDx9SsIty8/iShCfSM5uxYmTgXd5qxTnklGTmrwoCsVb181SNl54dJQd7eFr8Mq6VqL2 oDJCFc8WPLr6lYDtJnBY4N3ev0BAvkfuftseIpj3o7zJusCIOMsBQ0t1sSOQXbORrByB o2LgtLqkrx+G5Idq9PuUe4g4KwI1GZUEV3CYY+78gQy7xQGRnZI7FgwF/7RoaUR/YxjJ XNeA== X-Gm-Message-State: APjAAAX96ZbI7/9+VPSo3a6kdnjh+T0yR+bpjxubZQXHqMo2fPES1hlY pTz8vOht1hO+rqPWwsyxMZuUTRYNYcgx/SrxEDRvv9YYUA== X-Google-Smtp-Source: APXvYqy6TpWyUEotyzmG0Y1u61wBmbTjTMZX5dwJFamt6e7wi3PgA5vGiotOke8Ddnsgkr6LmqVgFsc4ImujEtUio/o= X-Received: by 2002:a05:6902:508:: with SMTP id x8mr72970498ybs.406.1564417810457; Mon, 29 Jul 2019 09:30:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Almog Gavra Date: Mon, 29 Jul 2019 09:29:59 -0700 Message-ID: Subject: Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary="000000000000df9679058ed467aa" --000000000000df9679058ed467aa Content-Type: text/plain; charset="UTF-8" I'm mostly happy with your current suggestion (two configs, one for serialization and one for deserialization) and your implementation suggestion. One thing to note: > We should _always_ be able to deserialize a standard JSON > number as a decimal I was doing some research into decimals and JSON, and I can imagine a compelling reason to require string representations to avoid losing precision and to be certain that whomever is sending the data isn't losing precision (e.g. https://stackoverflow.com/a/38357877/2258040). I'm okay with always allowing numerics, but thought it's worth raising the thought. On Mon, Jul 29, 2019 at 4:57 AM Andy Coates wrote: > The way I see it, we need to control two seperate things: > > 1. How do we _deserialize_ a decimal type if we encounter a text node in > the JSON? (We should _always_ be able to deserialize a standard JSON > number as a decimal). > 2. How do we chose how we want decimals to be _serialized_. > > This looks to fits well with your second suggestion of slightly different > configs names for serialization vs deserialization. > a, For deserialization we only care about how to handle text nodes: ` > deserialization.decimal.*text*.format`, which should only have two valid > values BINARY | TEXT. > b. For serialization we need all three: `serialization.decimal.format`, > which should support all three options: BINARY | TEXT | NUMERIC. > > Implementation wise, I think these should be two separate enums, rather > than one shared enum and throwing an error if the deserializer is set to > NUMERIC. Mainly as this means the enums reflect the options available, > rather than this being hidden in config checking code. But that's a minor > implementation detail. > > Personally, I'd be tempted to have the BINARY value named something like > `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move away from > it. > > It's a real shame that both of these settings require a default of BINARY > for backwards compatibility, but I agree that discussions / plans around > switching the defaults should not block this KIP. > > Andy > > > On Thu, 25 Jul 2019 at 18:26, Almog Gavra wrote: > > > Thanks for the replies Andy and Andrew (2x Andy?)! > > > > > Is the text decimal a base16 encoded number, or is it base16 encoded > > binary > > > form of the number? > > > > The conversion happens as decimal.unscaledValue().toByteArray() and then > > the byte array is converted to a hex string, so it's definitely the > binary > > form of the number converted to base16. Whether or not that's the same as > > the base16 encoded number is a good question (toByteArray returns a byte > > array containing a signed, big-endian, two's complement representation of > > the big integer). > > > > > One suggestion I have is to change the proposed new config to only > affect > > > decimals stored as text, i.e. to switch between the current base16 and > > the > > > more common base10. Then add another config to the serializer only > that > > > controls if decimals should be serialized as text or numeric. > > > > I think we need to be able to handle all mappings from serialization > format > > to deserialization format (e.g. read in BINARY and output TEXT), which I > > think would be impossible with the alternative suggestion. I agree that > > automatically deserializing numerics is valuable. I see two other ways to > > get this, both keeping the serialization.format config the same: > > > > - have json.decimal.deserialization.format accept all three formats. if > set > > to BINARY/TEXT, numerics would be automatically supported. If set to > > NUMERIC, then any string coming in would result in deserialization error > > (defaults to BINARY for backwards compatibility) > > - change json.decimal.deserialization.format to > > json.decimal.deserialization.string.format which accepts only BINARY/TEXT > > (defaults to BINARY for backwards compatibility) > > > > > would be a breaking change in that things that previously failed would > > > suddenly start deserializing. This is a price I'm willing to pay. > > > > I agree. I'm willing to pay this price too. > > > > > IMHO, we should then plan to switch the default of decimal > serialization > > to > > > numeric, and text serialization to base 10 in the next major release. > > > > I think that can be a separate discussion, I don't want to block this KIP > > on it. > > > > Thoughts? > > > > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto wrote: > > > > > This is a bit orthogonal, but in JsonSchemaConverter I use JSONSchemas > to > > > indicate whether a JSON number should be deserialized as an integer or > a > > > decimal > > > < > > > > > > https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261 > > > >. > > > Not everyone is going to have JSONSchemas available when converting, > but > > if > > > you do, it is an easy way to support JSON numbers as decimals. > > > > > > Carry on! :) > > > > > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates wrote: > > > > > > > Hi Almog, > > > > > > > > Like the KIP - I think being able to support decimals in JSON in the > > same > > > > way most other systems do is a great improvement. > > > > > > > > It's not 100% clear to me from the KIP what the current format is. > Is > > > the > > > > text decimal a base16 encoded number, or is it base16 encoded binary > > form > > > > of the number? (I've not tried to get my head around if these two are > > > even > > > > different!) > > > > > > > > One suggestion I have is to change the proposed new config to only > > affect > > > > decimals stored as text, i.e. to switch between the current base16 > and > > > the > > > > more common base10. Then add another config to the serialzier only > > that > > > > controls if decimals should be serialized as text or numeric. The > > > benefit > > > > of this approach is it allows us to enhance the deserializer to > > > > automatically handle numeric decimals even without any config having > to > > > be > > > > set, i.e. default config in the deserializer would be able to handle > > > > numeric decimals. Of course, this is a two edged sword: this would > > make > > > > the deserializer work out of the box with numeric decimals, (yay!), > but > > > > would be a breaking change in that things that previously failed > would > > > > suddenly start deserializing. This is a price I'm willing to pay. > > > > > > > > IMHO, we should then plan to switch the default of decimal > > serialization > > > to > > > > numeric, and text serialization to base 10 in the next major release. > > > > (With upgrade notes to match). Though I know this is more > contentious, > > I > > > > think it moves us forward in a much more standard way that the > current > > > > encoding of decimals. > > > > > > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra > wrote: > > > > > > > > > Hi Everyone! > > > > > > > > > > Kicking off discussion for a new KIP: > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON > > > > > > > > > > For those who are interested, I have a prototype implementation > that > > > > helped > > > > > guide my design: https://github.com/agavra/kafka/pull/1 > > > > > > > > > > Cheers, > > > > > Almog > > > > > > > > > > > > > > > --000000000000df9679058ed467aa--