hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng, Kai" <kai.zh...@intel.com>
Subject RE: A top container module like hadoop-cloud for cloud integration modules
Date Sun, 19 Jun 2016 10:50:13 GMT
Thanks Steve for the feedback and thoughts. 

Looks like people don't want to move around the related modules as it may not add much real
value. It's fine. I may provide better thoughts later when learn the aspect deeper.


-----Original Message-----
From: Steve Loughran [mailto:stevel@hortonworks.com] 
Sent: Wednesday, June 15, 2016 6:16 PM
To: Zheng, Kai <kai.zheng@intel.com>
Cc: common-dev@hadoop.apache.org
Subject: Re: A top container module like hadoop-cloud for cloud integration modules

> On 13 Jun 2016, at 14:02, Zheng, Kai <kai.zheng@intel.com> wrote:
> Hi,
> Noticed it's an obvious trend Hadoop is supporting more and more cloud platforms, I suggest
we have a top container module to hold such integration modules, like the ones for aws, openstack,
azure and upcoming one aliyun. The rational is simple besides the trend:

I'm kind of =0 right now

> 1.       Existing modules are mixed in Hadoop-tools that becomes a little big being of
18 modules now. Cloud specific ones can be grouped together and separated out, making more

the reason for having separate hadoop-aws, hadoop-openstack modules was always to permit the
modules to use APIs exclusive to cloud infrastructures, structure the downstream dependencies,
*and* allow people like the EMR team to swap in their own closed-source version. I don't think
anyone does that though.

It also lets us completely isolate testing: each module's tests only run if you have the credentials.

> 2.       Future abstraction and common specs & codes sharing could be easier or thereafter

Right now hadoop-common is where cross FS work and tests go. (Hint, reviewers for HADOOP-12807
needed.). I think we could start there with org.apache.hadoop.cloud package and only split
it out if compilation ordering merits it —or it adds any dependencies to hadoop-common.

> 3.       Common testing approach could be defined together, for example, some mechanisms
as discussed by Chris, Steve and Allen in HADOOP-12756;

In SPARK-7481 I've added downstream tests for S3a and azure in spark; this shows up that S3a
in Hadoop 2.6 gets its blocksize wrong (0) in listings, so the splits are all 1 byte wrong;
work dies. I think downstream tests in: Spark, Hive, etc would really round out cloud infra
testing, but we can't put those into Hadoop as the build DAG prevents it. (Reviews for SPARK-7481
needed too, BTW). System tests of Aliyun and perhaps GFS connectors would need to go in there
or in bigtop —which is the other place I've discussed having cloud integration tests.

> 4.       Documentation for "Hadoop on Cloud"? Not sure it's needed, as we already have
a section for "Hadoop compatible File Systems".

Again, we can stick this in common

> If sounds good, the change would be a good fit for Hadoop 3.0, even though the change
should not involve big impact, as it can avoid affecting the artifacts. It may cause some
inconveniences for the current development efforts, though.

I think it would make sense if other features went in. A good committer against object stores
would be an example here: it depends on the MR libraries, so can't go into common.Today it'd
have to go into hadoop-mapreduce. This isn't too bad, as long as the APIs it uses are all
in hadoop-common. It's only as things get more complex that it matters.

To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org
View raw message