hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elek, Marton (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-14898) Create official Docker images for development and testing features
Date Fri, 22 Sep 2017 09:17:00 GMT
Elek, Marton created HADOOP-14898:
-------------------------------------

             Summary: Create official Docker images for development and testing features 
                 Key: HADOOP-14898
                 URL: https://issues.apache.org/jira/browse/HADOOP-14898
             Project: Hadoop Common
          Issue Type: Improvement
            Reporter: Elek, Marton
            Assignee: Elek, Marton


This is the original mail from the mailing list:

{code}
TL;DR: I propose to create official hadoop images and upload them to the dockerhub.

GOAL/SCOPE: I would like improve the existing documentation with easy-to-use docker based
recipes to start hadoop clusters with various configuration.

The images also could be used to test experimental features. For example ozone could be tested
easily with these compose file and configuration:

https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6

Or even the configuration could be included in the compose file:

https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml

I would like to create separated example compose files for federation, ha, metrics usage,
etc. to make it easier to try out and understand the features.

CONTEXT: There is an existing Jira https://issues.apache.org/jira/browse/HADOOP-13397
But it’s about a tool to generate production quality docker images (multiple types, in a
flexible way). If no objections, I will create a separated issue to create simplified docker
images for rapid prototyping and investigating new features. And register the branch to the
dockerhub to create the images automatically.

MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a while and run
them succesfully in different environments (kubernetes, docker-swarm, nomad-based scheduling,
etc.) My work is available from here: https://github.com/flokkr but they could handle more
complex use cases (eg. instrumenting java processes with btrace, or read/reload configuration
from consul).
 And IMHO in the official hadoop documentation it’s better to suggest to use official apache
docker images and not external ones (which could be changed).
{code}

The next list will enumerate the key decision points regarding to docker image creating

A. automated dockerhub build  / jenkins build

Docker images could be built on the dockerhub (a branch pattern should be defined for a github
repository and the location of the Docker files) or could be built on a CI server and pushed.

The second one is more flexible (it's more easy to create matrix build, for example)
The first one had the advantage that we can get an additional flag on the dockerhub that the
build is automated (and built from the source by the dockerhub).

The decision is easy as ASF supports the first approach: (see https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)

B. source: binary distribution or source build

The second question is about creating the docker image. One option is to build the software
on the fly during the creation of the docker image the other one is to use the binary releases.

I suggest to use the second approach as:

1. In that case the hadoop:2.7.3 could contain exactly the same hadoop distrubution as the
downloadable one

2. We don't need to add development tools to the image, the image could be more smaller (which
is important as the goal for this image to getting started as fast as possible)

3. The docker definition will be more simple (and more easy to maintain)

Usually this approach is used in other projects (I checked Apache Zeppelin and Apache Nutch)

C. branch usage

Other question is the location of the Docker file. It could be on the official source-code
branches (branch-2, trunk, etc.) or we can create separated branches for the dockerhub (eg.
docker/2.7 docker/2.8 docker/3.0)

For the first approach it's easier to find the docker images, but it's less flexible. For
example if we had a Dockerfile for on the source code it should be used for every release
(for example the Docker file from the tag release-3.0.0 should be used for the 3.0 hadoop
docker image). In that case the release process is much more harder: in case of a Dockerfile
error (which could be test on dockerhub only after the taging), a new release should be added
after fixing the Dockerfile.

Another problem is that with using tags it's not possible to improve the Dockerfiles. I can
imagine that we would like to improve for example the hadoop:2.7 images (for example adding
more smart startup scripts) with using exactly the same hadoop 2.7 distribution. 

Finally with tag based approach we can't create images for the older releases (2.8.1 for example)

So I suggest to create separated branches for the Dockerfiles.

D. Versions

We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or just for the main
version (2.8/2.7). As these docker images are not for the production but for prototyping I
suggest to use (at least as a first step) just the 2.7/2.8 and update the images during the
bugfix release.

E. Number of images

There are two options here, too: Create a separated image for every component (namenode, datanode,
etc.) or just one, and the command should be defined everywhere manually. The second seems
to be more complex (to use), but I think the maintenance is easier, and it's more visible
what should be started 

F. Snapshots

According to the spirit of the Release policy:

https://www.apache.org/dev/release-distribution.html#unreleased

We should distribute only final releases to the dockerhub and not snapshots. But we can create
an empty hadoop-runner image as well, which container the starter scripts but not hadoop.
It would be used for development locally where the newly built distribution could be maped
to the image with docker volumes.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Mime
View raw message