hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Google Cloud Storage connector into Hadoop
Date Tue, 08 Dec 2015 17:50:52 GMT

1. do what chris says: go for the abstract contract tests. They'll find the troublespots in
your code, like the way seek(-1) appears to have entertaining results, what happens on operations
to closed files, etc, and help identify where the semantics of your FS varies from HDFS.

2. You will need to stay with the versions of artifacts in the Hadoop codebase. Troublespots
there are protobuf (frozen @ 2.5) and guava (shipping with 11.02, code must run against 18.x
+ if someone upgrades). If this is problematic you may want discuss the versioning issues
there with your colleagues; see https://issues.apache.org/jira/browse/HADOOP-10101 for the

3. the object stores get undertested: jenkins doesn't touch them for patch review or nightly
runs —you can't give jenkins the right credentials. Setting up your own jenkins server to
build the Hadoop versions and flag problems would be a great contribution here. Also: help
with the release testing; if someone has a patch for the hadoop-gcs module, review and test
that too would be great; stops these patches being neglected.

4. We could do with some more scale tests of the object stores, to test creating many thousands
of small files, etc. Contributions welcome

5. We could do with a lot more downstream testing of things like hive & spark IO on object
stores, especially via ORC and Parquet. Helping to write those tests would stop regressions
in the stack, and help tune Hadoop for your FS.

6. Finally: don't be afraid to get involved with the rest of the codebase. It can only get

> On 8 Dec 2015, at 00:20, James Malone <jamesmalone@google.com.INVALID> wrote:
> Haohui & Chris,
> Sounds great, thank you very much! We'll cut a JIRA once we get everything
> lined up.
> Best,
> James
> On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cnauroth@hortonworks.com>
> wrote:
>> Hi James,
>> This sounds great!  Thank you for considering contributing the code.
>> Just seconding what Haohui said, there is existing precedent for
>> alternative implementations of the Hadoop FileSystem in our codebase.  We
>> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
>> [3].  Additionally, we have a suite of FileSystem contract tests [4].
>> These tests are designed to help developers of alternative file systems
>> assess how closely they match the semantics expected by Hadoop ecosystem
>> components.
>> Many Hadoop users are accustomed to using HDFS instead of these
>> alternative file systems, so none of the alternatives are on the default
>> Hadoop classpath immediately after deployment.  Instead, the code for each
>> one is in a separate module under the "hadoop-tools" directory in the
>> source tree.  Users who need to use the alternative file systems take
>> extra steps post-deployment to add them to the classpath where necessary.
>> This achieves the dependency isolation needed.  For example, users who
>> never use the Azure plugin won't accidentally pick up a transitive
>> dependency on the Azure SDK jar.
>> I recommend taking a quick glance through the existing modules for S3,
>> Azure and OpenStack.  We'll likely ask that a new FileSystem
>> implementation follow the same patterns if feasible for consistency.  This
>> would include things like using the contract tests, having a provision to
>> execute tests both offline/mocked and live/integrated with the real
>> service and providing a documentation page that explains configuration for
>> end users.
>> For now, please feel free to file a HADOOP JIRA with your proposal.  We
>> can work out the details of all of this in discussion on that JIRA.
>> Something else to follow up on will be licensing concerns.  I see the
>> project already uses the Apache license, but it appears to be an existing
>> body of code initially developed at Google.  That might require a Software
>> Grant Agreement [5].  Again, this is something that can be hashed out in
>> discussion on the JIRA after you create it.
>> [1]
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
>> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
>> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
>> [4]
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
>> system/testing.html
>> [5] http://www.apache.org/licenses/
>> --Chris Nauroth
>> On 12/7/15, 3:10 PM, "Haohui Mai" <ricetons@gmail.com> wrote:
>>> Hi,
>>> Thanks for reaching out. It would be great to see this in the Hadoop
>>> ecosystem.
>>> In Hadoop we have AWS S3 support. IMO they address similar use cases
>>> thus I think that it should be relatively straightforward to adopt the
>>> code.
>>> The only catch in my head right now is to properly isolate dependency.
>>> Not only the code needs to be put into a separate module, but many
>>> Hadoop applications also depend on different versions of Guava. I
>>> think it might be a problem that needs some attentions at the very
>>> beginning.
>>> Please feel free to reach out if you have any other questions.
>>> Regards,
>>> Haohui
>>> On Mon, Dec 7, 2015 at 2:35 PM, James Malone
>>> <jamesmalone@google.com.invalid> wrote:
>>>> Hello,
>>>> We're from a team within Google Cloud Platform focused on OSS and data
>>>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
>>>> something we¹d like to do, we wanted to reach out to this list to ask a
>>>> two
>>>> quick questions, describe our proposed action, and check for any major
>>>> objections.
>>>> Proposed action:
>>>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
>>>> (GCS)
>>>> which we have been building and maintaining for some time. After we
>>>> clean
>>>> up our code and tests to conform (to these[3] and other requirements) we
>>>> would like to contribute it to Hadoop. We have many customers using the
>>>> connector in high-throughput production Hadoop clusters; we¹d like to
>>>> make
>>>> it easier and faster to use Hadoop and GCS.
>>>> Timeline:
>>>> Presently, we are working on the beta of Google Cloud Dataproc[4] which
>>>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
>>>> JIRA
>>>> issue and adapting our connector code as needed.
>>>> Our (quick) questions:
>>>> * Do we need to take any (non-coding) action for this beyond submitting
>>>> a
>>>> JIRA when we are ready?
>>>> * Are there any up-front concerns or questions which we can (or will
>>>> need
>>>> to) address?
>>>> Thank you!
>>>> James Malone
>>>> On behalf of the Google Big Data OSS Engineering Team
>>>> Links:
>>>> [1] -
>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>>>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
>>>> [3] -
>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>>>> [4] - https://cloud.google.com/dataproc

View raw message