hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14876) Create downstream developer docs from the compatibility guidelines
Date Tue, 31 Oct 2017 06:00:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226300#comment-16226300
] 

Anu Engineer commented on HADOOP-14876:
---------------------------------------

[~templedf] Thanks for putting in an effort to get this done. I really appreciate all the
thought that you have put into this document. I have some minor suggestions.

* Use cases Matrix: We have nine states, It would be nice to have a matrix that defines what
changes in what release. 
For example (based on InterfaceClassification.html), not suggesting that these are the definitions,
but add something that makes sense.
1. Public-Stable  - Changes only in a major release.
2. Public-Evolving - Changes possible in major, minor release.
3. Public-Unstable - Only for Web-UI.
4. Limited-Stable   -  Changes possible in major, minor release.
5. Limited-Evolving - Changes possible in major, minor release.
6. Limited-Unstable -  Changes possible in major, minor and maintenance release.
7. Private-*  - Changes possible in major, minor and maintenance release.
* It would be good to define which kind of releases are possible -- major, minor, and maintenance.
* Semantic compatibility
The semantics of the cluster is also defined by config files. The default values of the settings
and some new settings can change the semantics. We should not break the compatibility in maintenance
releases.
Currently, I am assuming that all Configs are public, but there are many that do not have
definitions in the default XML. We should mandate that these values are not modified in maintenance
releases.
Perhaps we should add a clause that states 
"No new configuration shall be added which can change the behavior of an existing cluster.
For any new settings that are defined, care should be taken to ensure that it does not change
the behavior of existing clusters."
* "The list of client artifacts is as follows:" -- may I suggest that we add the word "current"
-- since someone could add new jar without breaking compact. IMHO, The guarantee should be
that we will not break existing code, if we wanted to add a new JAR, it should be possible.

*Hadoop Env Vars: "that are meaningful to Hadoop" -- This is a very loose definition. We should
list out what will not change. Otherwise, all Hadoop variables are game. If that is the intention,
I suggest that we state that explicitly. 
* Native Dependencies: As a non-native English language speaker, I wonder if this statement
is ambiguous.
"Changes to the minimum required versions SHOULD NOT increase between minor releases within
a major version, though updates because of security issues, license issues, or other reasons
may occur."
Would we rewrite this as: 
"Hadoop will strive to maintain the minimum required versions of external dependencies stable
during the lifetime of a major version. It is possible that due to reasons like security,
license or end-of-life of a component, etc. We may be forced to upgrade."
* Protocol Dependencies:  "The components of Apache Hadoop may have dependencies that include
their own protocols, such as Zookeeper, S3, Kerberos, etc. These protocol dependencies SHALL
be treated as internal protocols and governed by the
same policy." 
I don't think that we can treat S3 or Kerberos as internal protocols. I suggest that we rewrite
this as  "To the extent possible, We will strive to maintain same policies for external protocols(S3,
Kerberos, etc.) that is used by Hadoop."
* Transports:  "Fixed service port numbers MUST be kept consistent to prevent breaking clients."
Did you mean to write, default service ports instead of fixed?
* New transport mechanisms MUST only be introduced with minor or major version changes.
Not sure why this constraint is placed, I am trying to understand how introducing a new transport(assuming
that older transports are stable) affects compatibility?
* Log output: "Log messages are intended for human consumption, though automation use cases
are also supported." Not sure if this is intended, but "automation use cases are also supported"
seems to imply that log will be parsable and stable. I am sure that is not what we want to
offer. Should we just remove the automation phrase?
* All log output SHALL be considered Public and Evolving
I worry this is not sustainable. Let me provide an example-- let us say I search for a word,
say block -- and now use that in a script which greps and identifies an event. Someone adds
a statement, which has the same word. My parser stops working, even in a maintenance release.
So in my mind, we should tag all log output as private and unstable, and used only for human
consumption. If the intent is to specify that the log format will not change, then we should
specify the log format is the one not changing.
* HDFS Metadata: HDFS data nodes store data in a private directory structure. The schema of
that directory structure must remain stable to retain compatibility.
If we have an upgrade path, I submit that this should be possible. In fact, I think we should
simply say, Upgrade and rollback of data stored in data node should be possible.
* Command Line Interface -- More of a question. Are we sure that 3.0 release is entirely complaint
to this spec? For example, is the slaves.txt change covered by this ? and if so is that change
fully compatible?
* Hadoop Configuration Files: Please see my comment in the semantics section.
* Directory Structure: Changing the directory structure of these user-accessible files can
break compatibility, even in cases where the original path is preserved via symbolic links.
Do you have a case where this has happened? If not, we should allow this change. "user-accessible"
is an extensive term. Does it mean all users along with Admins? If it is admins, all files
that we ship with Hadoop will fall into the scope of this statement. So perhaps, we should
define what this means, or say that files accessed via protocols offered by HDFS (RPC and
HTTP) will remain stable.
* Operating Systems: We should have a full list of supported version documented somewhere.
Is there such a link? If so can you please add a pointer to this document?

> Create downstream developer docs from the compatibility guidelines
> ------------------------------------------------------------------
>
>                 Key: HADOOP-14876
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14876
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 3.0.0-beta1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: Compatibility.pdf, DownstreamDev.pdf, HADOOP-14876.001.patch, HADOOP-14876.002.patch,
HADOOP-14876.003.patch, HADOOP-14876.004.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message