Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
From: Tsuyoshi Ozawa <ozawa@apache.org>
Date: Mon, 27 Mar 2017 21:16:21 +0900
Message-ID: <CAAD07OLiwnCH2=1gw79vz9nUWX-VOMy8KtGt2pj8_Ln32oeDnw@mail.gmail.com>
Subject: Can we update protobuf's version on trunk?
To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>,
	"yarn-dev@hadoop.apache.org" <yarn-dev@hadoop.apache.org>,
	"hdfs-dev@hadoop.apache.org" <hdfs-dev@hadoop.apache.org>,
	"mapreduce-dev@hadoop.apache.org" <mapreduce-dev@hadoop.apache.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
archived-at: Mon, 27 Mar 2017 12:16:26 -0000

Dear Hadoop developers,

After shaded client, introduced by HADOOP-11804, is merged,
we can more easily update some dependency with minimizing the impact
of backward compatibility on trunk. (Thanks Sean and Sanjin for taking
the issue!)

Then, is it time to update protobuf's version to the latest one on
trunk? Could you share your opinion here?

There has been plural discussions in parallel so far. Hence, I would
like to share current opinions by developers with my understanding
here.

Stack mentioned on HADOOP-13363:
* Would this be a problem? Old clients can talk to the new servers
because of wire compatible. Is anyone consuming hadoop protos directly
other than hadoop? Are hadoop proto files considered
InterfaceAudience.Private or InterfaceAudience.Public? If the former,
I could work on a patch for 3.0.0 (It'd be big but boring). Does
Hadoop have Protobuf in its API anywhere (I can take a look but being
lazy asking here first).

gohadoop[1] uses proto files directly, treating the proto files as a
stable interface.
[1] https://github.com/hortonworks/gohadoop/search?utf8=3D%E2%9C%93&q=3D*pr=
oto&type=3D

Fortunately, in fact, no additional work is needed to compile hadoop
code base. Only one work I did is to change getOndiskTrunkSize's
argument to take protobuf v3's object[2]. Please point me if I'm
something missing.

[2] https://issues.apache.org/jira/secure/attachment/12860647/HADOOP-13363.=
004.patch

There are some concerns against updating protobuf on HDFS-11010:
* I'm really hesitant to bump PB considering the pain it brought last
time. (by Andrew)

This is because there are no *binary* compatibility, not wire
compatibility. If I understand correctly, at the last time, the
problem is caused by mixing v2.4.0 and v.2.5.0 class are mixed between
Hadoop and HBase. (I knew this fact on Steve's comment on
HADOOP-13363[3])
As I firstly mentioned, the protobuf is shaded now on trunk. We don't
need to care binary(source code level) compatibility.

[3] https://issues.apache.org/jira/browse/HADOOP-13363?focusedCommentId=3D1=
5372724&page=3Dcom.atlassian.jira.plugin.system.issuetabpanels:comment-tabp=
anel#comment-15372724

* Have we checked if it's wire compatible with our current version of
PB? (by Andrew)

As far as I know, it's wire compatible between protobuf v2 and protobuf v3.
Google team has been testing it. Of course we can validate it by using
a following script.

https://chromium.googlesource.com/external/github.com/google/protobuf/+/mas=
ter/java/compatibility_tests/README.md

* Let me ask the question in a different way, what about PB 3 is
concerning to you ?(by Anu)

* Some of its incompatibilities with 2.x, such as dropping unknown
fields from records. Any component that proxies records must have an
updated version of the schema, or it will silently drop data and
convert unknown values to defaults. Unknown enum value handling has
changed. There's no mention of the convenient "Embedded messages are
compatible with bytes if the bytes contain an encoded version of the
message" semantics in proto3. (by Chris)

This is what we need to discuss.
Quoting a document from google's developer's manual,
https://developers.google.com/protocol-buffers/docs/proto3#unknowns

> For most Google protocol buffers implementations, unknown fields are not =
accessible in proto3 via the corresponding proto runtimes, and are dropped =
and forgotten at deserialization time. This is different behaviour to proto=
2, where unknown fields are always preserved and serialized along with the =
message.

Is this incompatibility acceptable, or not acceptable for us? If we
need to check some test cases before updating protobuf, it's nice to
clarify the test cases we need to check here and test it now.

Best regards,
- Tsuyoshi

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org