hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "GitAndHadoop" by SteveLoughran
Date Sat, 26 Nov 2011 20:11:40 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "GitAndHadoop" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/GitAndHadoop?action=diff&rev1=15&rev2=16

Comment:
add a lot more on managing branches between apache and github repositories

   * GitHub provide some good lessons on git at [[http://learn.github.com]]
   * Apache serves up read-only Git versions of their source at [[http://git.apache.org/]].
People cannot commit changes with Git; for that the patches need to be applied to the SVN
repositories
  
- == Before you begin ==
  
-  1. You need a copy of git on your system. Some IDEs ship with Git support; this page assumes
you are using the command line.
-  1. You need a copy of Ant 1.7+ on your system for the builds themselves.
-  1. You need to be online for your first checkout and build, and any subsequent build which
needs to download new artifacts from the central JAR repositories.
-  1. You need to set Ant up so that it works with any proxy you have. This is documented
by [[http://ant.apache.org/manual/proxy.html |the ant team]].
+ 
+ 
+ This page tells you how to work with Git. See HowToContribute for instructions on building
and testing Hadoop.
+ 
+ == Key Git Concepts ==
+ The key concepts of Git.
+ 
+  * Git doesn't store changes, it snapshots the entire source tree. Good for fast switch
and rollback, bad for binaries. (as an enhancement, if a file hasn't changed, it doesn't re-replicate
it).
+  * Git stores all "events" as SHA1 checksummed objects; you have deltas, tags and commits,
where a commit describes the status of items in the tree.
+  * Git is very branch centric; you work in your own branch off local or central repositories
+  * You had better enjoy merging.
  
  
  == Checking out the source ==
  
+ You need a copy of git on your system. Some IDEs ship with Git support; this page assumes
you are using the command line.
+ 
- The first step is to create your own Git repository from the Apache repository. The hadoop
subprojects (common, HDFS, and MapReduce) live inside a combined repo called `hadoop-common.git`.
+ Clone a local Git repository from the Apache repository. The Hadoop subprojects (common,
HDFS, and MapReduce) live inside a combined repository called `hadoop-common.git`.
  
  {{{
  git clone git://git.apache.org/hadoop-common.git
  }}}
- The total download is well over 100MB, so the initial checkout process works best when the
network is fast. Once downloaded, Git works offline.
+ 
+ The total download is well over 100MB, so the initial checkout process works best when the
network is fast. Once downloaded, Git works offline -though you will need to perform your
initial builds online so that the build tools (Maven, Ivy &c) can download dependencies.
  
  == Grafts for complete project history ==
  
- The Hadoop project has undergone some movement in where its component parts have been versioned.
Because of that, commands like `git log --follow` need to have a little help. To graft the
history back together into a coherent whole, insert the following contents into `hadoop-common/.git/info/grafts`:
+ The Hadoop project has undergone some movement in where its component parts have been versioned.
Because of that, commands like `git log --follow` needs to have a little help. To graft the
history back together into a coherent whole, insert the following contents into `hadoop-common/.git/info/grafts`:
  
  {{{
  5128a9a453d64bfe1ed978cf9ffed27985eeef36 6c16dc8cf2b28818c852e95302920a278d07ad0c
@@ -42, +51 @@

  
   1. Create a GitHub login at http://github.com/ ; Add your public SSH keys
   1. Go to http://github.com/apache and search for the Hadoop and other Apache projects you
want (avro is handy alongside the others)
-  1. For each project, fork. This gives you your own repository URL which you can then clone
locally with {{{git clone}}}
+  1. For each project, fork in the githb UI. This gives you your own repository URL which
you can then clone locally with {{{git clone}}}
   1. For each patch, branch.
  
- At the time of writing (December 2009), GitHub was updating its copy of the Apache repositories
every hour. As the Apache repositories were updating every 15 minutes, provided these frequencies
are retained, a GitHub-fork derived version will be at worst 1 hour and 15 minutes behind
the ASF's SVN repository. If you are actively developing on Hadoop, especially committing
code into the SVN repository, that is too long -work off the Apache repositories instead.
+ At the time of writing (December 2009), GitHub was updating its copy of the Apache repositories
every hour. As the Apache repositories were updating every 15 minutes, provided these frequencies
are retained, a GitHub-fork derived version will be at worst 1 hour and 15 minutes behind
the ASF's SVN repository. If you are actively developing on Hadoop, especially committing
code into the SVN repository, that is too long -work off the Apache repositories instead.

  
- == Building the source ==
+  1. Clone the read-only repository from Github (their recommendation) or from Apache (the
ASF's recommendation)
+  1. in that clone, rename that repository "apache": {{{git remote rename origin apache}}}
+  1. Log in to [http://github.com]
+  1. Create a new repository (e.g hadoop-fork)
+  1. In the existing clone, add the new repository : 
+  {{{git remote add -f github git@github.com:MYUSERNAMEHERE/hadoop-common.git}}}
  
- You need to tell all the Hadoop modules to get a local JAR of the bits of Hadoop they depend
on. You do this by making sure your Hadoop version does not match anything public, and to
use the "internal" repository of locally published artifacts.
+ This gives you a local repository with two remote repositories: "apache" and "github". Apache
has the trunk branch, which you can update whenever you want to get the latest ASF version:
  
- === Create a build.properties file ===
- 
- Create a {{{build.properties}}} file. Do not do this in the git directories, do it one up.
This is going to be a shared file. This article assumes you are using Linux or a different
Unix, incidentally.
- 
- Make the file something like this:
  {{{
+  git co trunk
+  git pull apache
- #this is essential
- resolvers=internal
- #you can increment this number as you see fit
- version=0.22.0-alpha-1
- project.version=${version}
- hadoop.version=${version}
- hadoop-core.version=${version}
- hadoop-hdfs.version=${version}
- hadoop-mapred.version=${version}
  }}}
  
- The {{{resolvers}}} property tells Ivy to look in the local maven artifact repository for
versions of the Hadoop artifacts; if you don't set this then only published JARs from the
central repostiory will get picked up.
+ Your own branches can be merged with trunk, and pushed out to git hub. To generate patches
for submitting as JIRA patches, check everything in to your specific branch, merge that with
(a recently pulled) trunk, then diff the two:
+ {{{ git diff --no-prefix trunk > ../hadoop-patches/HADOOP-XYX.patch }}}
  
- The version property, and descendents, tells Hadoop which version of artifacts to create
and use. Set this to something different (ideally ahead of) what is being published, to ensure
that your own artifacts are picked up.
+ If you are working deep in the code it's not only convenient to have a directory full of
patches to the JIRA issues, it's convenient to have that directory a git repository that is
pushed to a remote server, such as [[https://github.com/steveloughran/hadoop-patches|this
example]]. Why? It helps you move patches from machine to machine without having to do all
the updating and merging. From a pure-git perspective this is wrong: it loses history, but
for a mixed git/svn workflow it doesn't matter so much.
  
- Next, symlink this file to every Hadoop module. Now a change in the file gets picked up
by all three.
- {{{
- pushd common; ln -s ../build.properties build.properties; popd
- pushd hdfs; ln -s ../build.properties build.properties; popd
- pushd mapreduce; ln -s ../build.properties build.properties; popd
- }}}
- 
- You are now all set up to build.
- 
- === Build Hadoop ===
- 
-  1. In {{{common/}}} run {{{ant mvn-install}}}
-  1. In {{{hdfs/}}} run {{{ant mvn-install}}}
-  1. In {{{mapreduce/}}} run {{{ant mvn-install}}}
- 
- This Ant target not only builds the JAR files, it copies it to the local {{{${user.home}/.m2}}}
directory, where it will be picked up by the "internal" resolver. You can check that this
is taking place by running {{{ant ivy-report}}} on a project and seeing where it gets its
dependencies.
- 
- '''Warning:''' it's easy for old JAR versions to get cached and picked up. You will notice
this early if something in hadoop-hdfs or hadoop-mapreduce doesn't compile, but if you are
unlucky things do compile, just not work as your updates are not picked up. Run {{{ant clean-cache}}}
to fix this. 
- 
- By default, the trunk of the HDFS and mapreduce projects are set to grab the snapshot versions
that get built and published into the Apache snapshot repository nightly. While this saves
developers in these projects the complexity of having to build and publish the upstream artifacts
themselves, it doesn't work if you do want to make changes to things like hadoop-common. You
need to make sure the local projects are picking up what's being built locally. 
- 
- To check this in the hadoop-hdfs project, generate the Ivy dependency reports using the
internal resolver:
- {{{
- ant ivy-report -Dresolvers=internal
- }}}
- 
- Then browse to the report page listed at the bottom of the process, switch to the "common"
tab, and look for hadoop-common JAR. It should have a publication timestamp which contains
the date and time of your local build. For example, the string "	20110211174419"> means
the date 2011-02-11 and the time of 17:44:19. If an older version is listed, you probably
have it cached in the ivy cache -you can fix this by removing everything from the org.apache
corner of this cache.
- 
- {{{
- rm -rf ~/.ivy2/cache/org.apache.hadoop
- }}}
- 
- Rerun the {{{ivy-report}}} target and check that the publication date is current to verify
that the version is now up to date.
- 
- 
- === Testing ===
- 
- Each project comes with lots of tests; run {{{ant test}}} to run the all, {{{ant test-core}}}
for the core tests. If you have made changes to the build and tests fail, it may be that the
tests never worked on your machine. Build and test the unmodified source first. Then keep
an eye on both the main source and any branch you make. A good way to do this is to give a
Continuous Integration server such as Hudson this job: checking out, building and testing
both branches.
- 
- Remember, the way Git works, your machine's own repository is something that other machines
can fetch from. So in theory, you could set up a Hudson server on another machine (or VM)
and have it pull and test against your local code. You will need to run it on a separate machine
to avoid your own builds and tests from interfering with the Hudson runs.
  
  == Branching ==
  
- Git makes it easy to branch. The recommended process for working with Apache projects is:
one branch per JIRA issue. That makes it easy to isolate development and track the development
of each change. It does mean if you have your own branch that you release, one that merges
in more than one issue, you have to invest some effort in merging everything in. Try not to
make changes in different branches that are hard to merge, and learn your way round the git
rebase command to handle changes across branches.
+ Git makes it easy to branch. The recommended process for working with Apache projects is:
one branch per JIRA issue. That makes it easy to isolate development and track the development
of each change. It does mean if you have your own branch that you release, one that merges
in more than one issue, you have to invest some effort in merging everything in. Try not to
make changes in different branches that are hard to merge, and learn your way round the git
rebase command to handle changes across branches. Better yet: do not use rebase once you have
created a chain of branches that each depend on each other
- 
- One thing you need to look out for is making sure that you are building the different Hadoop
projects together; that you have not published on one branch and built on another. This is
because both Ivy and Maven publish artifacts to shared repository cache directories.
- 
-  1. Don't be afraid to {{{rm -rf ~/.m2/repository/org/apache/hadoop}}}  and {{{rm -rf ~/.ivy2/cache/org.apache.hadoop}}}
to remove local copies of artifacts.
-  1. Use different version properties in different branches to ensure that different versions
are not accidentally picked up
-  1. Avoid using {{{latest.version}}} as the version marker in Ivy, as that gives you the
last built.
-  1. Don't build/test different branches simultaneously, such as by running Hudson on your
local machine while developing on the console. The trick here is bring up Hudson in a virtual
machine, running against the Git repository on your desktop. Git lets you do this, which lets
you run Hudson against your private branch.
  
  === Creating the branch ===
  
  Creating a branch is quick and easy
  {{{
- #start off in your trunk
+ #start off in the apache trunk
  git checkout trunk
  #create a new branch from trunk
  git branch HDFS-775
@@ -146, +102 @@

  Assuming your trunk repository is in sync with the Apache projects, you can use {{{git diff}}}
to create a patch file.
  First, have a directory for your patches:
  {{{
- mkdir ../outgoing
+ mkdir ../hadoop-patches
  }}}
  Then generate a patch file listing the differences between your trunk and your branch
  {{{
- git diff --no-prefix trunk > ../outgoing/HDFS-775-1.patch
+ git diff --no-prefix trunk > ../hadoop-patches/HDFS-775-1.patch
  }}}
  The patch file is an extended version of the unified patch format used by other tools; type
{{{git help diff}}} to get more details on it. Here is what the patch file in this example
looks like
  {{{
@@ -183, +139 @@

  }}}
  It is essential that patches for JIRA issues are generated with the {{{--no-prefix}}} option.
Without that an extra directory path is listed, and the patches can only be applied with a
{{{patch -p1}}} call, ''which Hudson does not know to do''. If you want your patches to take,
this is what you have to do. You can of course test this yourself by using a command like
{{{patch -p0 << ../outgoing/HDFS-775.1}}} in a copy of the SVN source tree to test that
your patch takes.
  
+ === Updating your patch ===
- If you have checked in your patch, then you need to refer to the patch by name (SHA1 checksum),
and that of the preceeding patch. If it was the last patch and nothing else changed, this
is easy.
- {{{
- git diff --no-prefix HEAD~1 HEAD
- }}}
  
+ If your patch is not immediately accepted, do not be offended: it happens to us all. It
introduces a problem: your branches become out of date. You need to check out the latest apache
version, merge your branches with it, and then push the changes back to github
+ 
+ {{{
+  git co trunk
+  git pull apache
+  git co mybranch
+  git merge trunk
+  git push github mybranch
+ }}}
+ 
+ Your branch is up to date, and new diffs can be created and attached to patches. 
+ 
+ === Deriving Branches from Branches ===
+ 
+ If you have one patch that depends upon another, you should have a separate branch for each
one. Simply merge the changes from the first branch into the second, so that it is always
kept up to date with the first changes. To create a patch file for submission as a JIRA patch,
do a diff between the two branches, not against trunk.
+ 
+ '''do not play with rebasing once you start doing this as you will make merging a nightmare'''
+ 
+ === What to do when your patch is committed ===
+ 
+ Once your patch is committed into SVN, you do not need the branch any more. You can delete
it straight away, but it is safer to verify the patch is completely merged in
+ 
+ Pull down the latest release and verify that the patch branch is synchronized
+ 
+ {{{
+  git co trunk
+  git pull apache
+  git co mybranch
+  git merge trunk
+  git diff trunk
+ }}}
+ 
+ the output of the last command should be nothing: the two branches should be identical.
You can then prove to git that this is true by switching back to the trunk branch and merging
in the branch, an operation which will not change the source tree, but update Git's branch
graph.
+ 
+ {{{
+  git co trunk
+  git merge mybranch
+ }}}
+ 
+ Now you can delete the branch without being warned by git
+ {{{
+  git branch -d mybranch
+ }}}
+ 
+ Finally, propagate that deletion to your private github repository
+ {{{
+  git push github :mybranch
+ }}}
+ 
+ This odd syntax says "push nothing to github/mybranch".
+ 
+ 
+ == Building with a Git repository ==
+ 
+ ''The information below this line is relevant for versions of Hadoop before 0.23.x, and
should be considered obsolute for later versions. It is probably out of date for Hadoop 0.22
as well.''
+ 
+ == Building the source ==
+ 
+ You need to tell all the Hadoop modules to get a local JAR of the bits of Hadoop they depend
on. You do this by making sure your Hadoop version does not match anything public, and to
use the "internal" repository of locally published artifacts.
+ 
+ === Create a build.properties file ===
+ 
+ Create a {{{build.properties}}} file. Do not do this in the git directories, do it one up.
This is going to be a shared file. This article assumes you are using Linux or a different
Unix, incidentally.
+ 
+ Make the file something like this:
+ {{{
+ #this is essential
+ resolvers=internal
+ #you can increment this number as you see fit
+ version=0.22.0-alpha-1
+ project.version=${version}
+ hadoop.version=${version}
+ hadoop-core.version=${version}
+ hadoop-hdfs.version=${version}
+ hadoop-mapred.version=${version}
+ }}}
+ 
+ The {{{resolvers}}} property tells Ivy to look in the local maven artifact repository for
versions of the Hadoop artifacts; if you don't set this then only published JARs from the
central repostiory will get picked up.
+ 
+ The version property, and descendents, tells Hadoop which version of artifacts to create
and use. Set this to something different (ideally ahead of) what is being published, to ensure
that your own artifacts are picked up.
+ 
+ Next, symlink this file to every Hadoop module. Now a change in the file gets picked up
by all three.
+ {{{
+ pushd common; ln -s ../build.properties build.properties; popd
+ pushd hdfs; ln -s ../build.properties build.properties; popd
+ pushd mapreduce; ln -s ../build.properties build.properties; popd
+ }}}
+ 
+ You are now all set up to build.
+ 
+ === Build Hadoop ===
+ 
+  1. In {{{common/}}} run {{{ant mvn-install}}}
+  1. In {{{hdfs/}}} run {{{ant mvn-install}}}
+  1. In {{{mapreduce/}}} run {{{ant mvn-install}}}
+ 
+ This Ant target not only builds the JAR files, it copies it to the local {{{${user.home}/.m2}}}
directory, where it will be picked up by the "internal" resolver. You can check that this
is taking place by running {{{ant ivy-report}}} on a project and seeing where it gets its
dependencies.
+ 
+ '''Warning:''' it's easy for old JAR versions to get cached and picked up. You will notice
this early if something in hadoop-hdfs or hadoop-mapreduce doesn't compile, but if you are
unlucky things do compile, just not work as your updates are not picked up. Run {{{ant clean-cache}}}
to fix this. 
+ 
+ By default, the trunk of the HDFS and mapreduce projects are set to grab the snapshot versions
that get built and published into the Apache snapshot repository nightly. While this saves
developers in these projects the complexity of having to build and publish the upstream artifacts
themselves, it doesn't work if you do want to make changes to things like hadoop-common. You
need to make sure the local projects are picking up what's being built locally. 
+ 
+ To check this in the hadoop-hdfs project, generate the Ivy dependency reports using the
internal resolver:
+ {{{
+ ant ivy-report -Dresolvers=internal
+ }}}
+ 
+ Then browse to the report page listed at the bottom of the process, switch to the "common"
tab, and look for hadoop-common JAR. It should have a publication timestamp which contains
the date and time of your local build. For example, the string "	20110211174419"> means
the date 2011-02-11 and the time of 17:44:19. If an older version is listed, you probably
have it cached in the ivy cache -you can fix this by removing everything from the org.apache
corner of this cache.
+ 
+ {{{
+ rm -rf ~/.ivy2/cache/org.apache.hadoop
+ }}}
+ 
+ Rerun the {{{ivy-report}}} target and check that the publication date is current to verify
that the version is now up to date.
+ 
+ 
+ === Testing ===
+ 
+ Each project comes with lots of tests; run {{{ant test}}} to run the all, {{{ant test-core}}}
for the core tests. If you have made changes to the build and tests fail, it may be that the
tests never worked on your machine. Build and test the unmodified source first. Then keep
an eye on both the main source and any branch you make. A good way to do this is to give a
Continuous Integration server such as Hudson this job: checking out, building and testing
both branches.
+ 
+ Remember, the way Git works, your machine's own repository is something that other machines
can fetch from. So in theory, you could set up a Hudson server on another machine (or VM)
and have it pull and test against your local code. You will need to run it on a separate machine
to avoid your own builds and tests from interfering with the Hudson runs.
+ 

Mime
View raw message