From common-commits-return-16119-apmail-hadoop-common-commits-archive=hadoop.apache.org@hadoop.apache.org Sat Nov 26 20:12:06 2011 Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E18BB9A6A for ; Sat, 26 Nov 2011 20:12:06 +0000 (UTC) Received: (qmail 3999 invoked by uid 500); 26 Nov 2011 20:12:06 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 3942 invoked by uid 500); 26 Nov 2011 20:12:06 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 3935 invoked by uid 500); 26 Nov 2011 20:12:06 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 3932 invoked by uid 99); 26 Nov 2011 20:12:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Nov 2011 20:12:06 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Nov 2011 20:12:01 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 287FAC17; Sat, 26 Nov 2011 20:11:40 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sat, 26 Nov 2011 20:11:40 -0000 Message-ID: <20111126201140.96210.36408@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22GitAndHadoop=22_by_SteveLoughran?= Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "GitAndHadoop" page has been changed by SteveLoughran: http://wiki.apache.org/hadoop/GitAndHadoop?action=3Ddiff&rev1=3D15&rev2=3D16 Comment: add a lot more on managing branches between apache and github repositories * GitHub provide some good lessons on git at [[http://learn.github.com]] * Apache serves up read-only Git versions of their source at [[http://gi= t.apache.org/]]. People cannot commit changes with Git; for that the patche= s need to be applied to the SVN repositories = - =3D=3D Before you begin =3D=3D = - 1. You need a copy of git on your system. Some IDEs ship with Git suppor= t; this page assumes you are using the command line. - 1.=C2=A0You need a copy of Ant 1.7+ on your system for the builds themse= lves. - 1. You need to be online for your first checkout and build, and any subs= equent build which needs to download new artifacts from the central JAR rep= ositories. - 1. You need to set Ant up so that it works with any proxy you have. This= is documented by [[http://ant.apache.org/manual/proxy.html |the ant team]]. + = + = + This page tells you how to work with Git. See HowToContribute for instruc= tions on building and testing Hadoop. + = + =3D=3D Key Git Concepts =3D=3D + The key concepts of Git. + = + * Git doesn't store changes, it snapshots the entire source tree. Good f= or fast switch and rollback, bad for binaries. (as an enhancement, if a fil= e hasn't changed, it doesn't re-replicate it). + * Git stores all "events" as SHA1 checksummed objects; you have deltas, = tags and commits, where a commit describes the status of items in the tree. + * Git is very branch centric; you work in your own branch off local or c= entral repositories + * You had better enjoy merging. = = =3D=3D Checking out the source =3D=3D = + You need a copy of git on your system. Some IDEs ship with Git support; t= his page assumes you are using the command line. + = - The first step is to create your own Git repository from the Apache repos= itory. The hadoop subprojects (common, HDFS, and MapReduce) live inside a c= ombined repo called `hadoop-common.git`. + Clone a local Git repository from the Apache repository. The Hadoop subpr= ojects (common, HDFS, and MapReduce) live inside a combined repository call= ed `hadoop-common.git`. = {{{ git clone git://git.apache.org/hadoop-common.git }}} - The total download is well over 100MB, so the initial checkout process wo= rks best when the network is fast. Once downloaded, Git works offline. + = + The total download is well over 100MB, so the initial checkout process wo= rks best when the network is fast. Once downloaded, Git works offline -thou= gh you will need to perform your initial builds online so that the build to= ols (Maven, Ivy &c) can download dependencies. = =3D=3D Grafts for complete project history =3D=3D = - The Hadoop project has undergone some movement in where its component par= ts have been versioned. Because of that, commands like `git log --follow` n= eed to have a little help. To graft the history back together into a cohere= nt whole, insert the following contents into `hadoop-common/.git/info/graft= s`: + The Hadoop project has undergone some movement in where its component par= ts have been versioned. Because of that, commands like `git log --follow` n= eeds to have a little help. To graft the history back together into a coher= ent whole, insert the following contents into `hadoop-common/.git/info/graf= ts`: = {{{ 5128a9a453d64bfe1ed978cf9ffed27985eeef36 6c16dc8cf2b28818c852e95302920a27= 8d07ad0c @@ -42, +51 @@ = 1. Create a GitHub login at http://github.com/ ; Add your public SSH keys 1. Go to http://github.com/apache and search for the Hadoop and other Ap= ache projects you want (avro is handy alongside the others) - 1. For each project, fork. This gives you your own repository URL which = you can then clone locally with {{{git clone}}} + 1. For each project, fork in the githb UI. This gives you your own repos= itory URL which you can then clone locally with {{{git clone}}} 1. For each patch, branch. = - At the time of writing (December 2009), GitHub was updating its copy of t= he Apache repositories every hour. As the Apache repositories were updating= every 15 minutes, provided these frequencies are retained, a GitHub-fork d= erived version will be at worst 1 hour and 15 minutes behind the ASF's SVN = repository. If you are actively developing on Hadoop, especially committing= code into the SVN repository, that is too long -work off the Apache reposi= tories instead. + At the time of writing (December 2009), GitHub was updating its copy of t= he Apache repositories every hour. As the Apache repositories were updating= every 15 minutes, provided these frequencies are retained, a GitHub-fork d= erived version will be at worst 1 hour and 15 minutes behind the ASF's SVN = repository. If you are actively developing on Hadoop, especially committing= code into the SVN repository, that is too long -work off the Apache reposi= tories instead. = = - =3D=3D Building the source =3D=3D + 1. Clone the read-only repository from Github (their recommendation) or = from Apache (the ASF's recommendation) + 1. in that clone, rename that repository "apache": {{{git remote rename = origin apache}}} + 1. Log in to [http://github.com] + 1. Create a new repository (e.g hadoop-fork) + 1. In the existing clone, add the new repository : = + {{{git remote add -f github git@github.com:MYUSERNAMEHERE/hadoop-common.= git}}} = - You need to tell all the Hadoop modules to get a local JAR of the bits of= Hadoop they depend on. You do this by making sure your Hadoop version does= not match anything public, and to use the "internal" repository of locally= published artifacts. + This gives you a local repository with two remote repositories: "apache" = and "github". Apache has the trunk branch, which you can update whenever yo= u want to get the latest ASF version: = - =3D=3D=3D Create a build.properties file =3D=3D=3D - = - Create a {{{build.properties}}} file. Do not do this in the git directori= es, do it one up. This is going to be a shared file. This article assumes y= ou are using Linux or a different Unix, incidentally. - = - Make the file something like this: {{{ + git co trunk + git pull apache - #this is essential - resolvers=3Dinternal - #you can increment this number as you see fit - version=3D0.22.0-alpha-1 - project.version=3D${version} - hadoop.version=3D${version} - hadoop-core.version=3D${version} - hadoop-hdfs.version=3D${version} - hadoop-mapred.version=3D${version} }}} = - The {{{resolvers}}} property tells Ivy to look in the local maven artifac= t repository for versions of the Hadoop artifacts; if you don't set this th= en only published JARs from the central repostiory will get picked up. + Your own branches can be merged with trunk, and pushed out to git hub. To= generate patches for submitting as JIRA patches, check everything in to yo= ur specific branch, merge that with (a recently pulled) trunk, then diff th= e two: + {{{ git diff --no-prefix trunk > ../hadoop-patches/HADOOP-XYX.patch }}} = - The version property, and descendents, tells Hadoop which version of arti= facts to create and use. Set this to something different (ideally ahead of)= what is being published, to ensure that your own artifacts are picked up. + If you are working deep in the code it's not only convenient to have a di= rectory full of patches to the JIRA issues, it's convenient to have that di= rectory a git repository that is pushed to a remote server, such as [[https= ://github.com/steveloughran/hadoop-patches|this example]]. Why? It helps yo= u move patches from machine to machine without having to do all the updatin= g and merging. From a pure-git perspective this is wrong: it loses history,= but for a mixed git/svn workflow it doesn't matter so much. = - Next, symlink this file to every Hadoop module. Now a change in the file = gets picked up by all three. - {{{ - pushd common; ln -s ../build.properties build.properties; popd - pushd hdfs; ln -s ../build.properties build.properties; popd - pushd mapreduce; ln -s ../build.properties build.properties; popd - }}} - = - You are now all set up to build. - = - =3D=3D=3D Build Hadoop =3D=3D=3D - = - 1. In {{{common/}}} run {{{ant mvn-install}}} - 1. In {{{hdfs/}}} run {{{ant mvn-install}}} - 1. In {{{mapreduce/}}} run {{{ant mvn-install}}} - = - This Ant target not only builds the JAR files, it copies it to the local = {{{${user.home}/.m2}}} directory, where it will be picked up by the "intern= al" resolver. You can check that this is taking place by running {{{ant ivy= -report}}} on a project and seeing where it gets its dependencies. - = - '''Warning:''' it's easy for old JAR versions to get cached and picked up= . You will notice this early if something in hadoop-hdfs or hadoop-mapreduc= e doesn't compile, but if you are unlucky things do compile, just not work = as your updates are not picked up. Run {{{ant clean-cache}}} to fix this. = - = - By default, the trunk of the HDFS and mapreduce projects are set to grab = the snapshot versions that get built and published into the Apache snapshot= repository nightly. While this saves developers in these projects the comp= lexity of having to build and publish the upstream artifacts themselves, it= doesn't work if you do want to make changes to things like hadoop-common. = You need to make sure the local projects are picking up what's being built = locally. = - = - To check this in the hadoop-hdfs project, generate the Ivy dependency rep= orts using the internal resolver: - {{{ - ant ivy-report -Dresolvers=3Dinternal - }}} - = - Then browse to the report page listed at the bottom of the process, switc= h to the "common" tab, and look for hadoop-common JAR. It should have a pub= lication timestamp which contains the date and time of your local build. Fo= r example, the string " 20110211174419"> means the date 2011-02-11 and the = time of 17:44:19. If an older version is listed, you probably have it cache= d in the ivy cache -you can fix this by removing everything from the org.ap= ache corner of this cache. - = - {{{ - rm -rf ~/.ivy2/cache/org.apache.hadoop - }}} - = - Rerun the {{{ivy-report}}} target and check that the publication date is = current to verify that the version is now up to date. - = - = - =3D=3D=3D Testing =3D=3D=3D - = - Each project comes with lots of tests; run {{{ant test}}} to run the all,= {{{ant test-core}}} for the core tests. If you have made changes to the bu= ild and tests fail, it may be that the tests never worked on your machine. = Build and test the unmodified source first. Then keep an eye on both the ma= in source and any branch you make. A good way to do this is to give a Conti= nuous Integration server such as Hudson this job: checking out, building an= d testing both branches. - = - Remember, the way Git works, your machine's own repository is something t= hat other machines can fetch from. So in theory, you could set up a Hudson = server on another machine (or VM) and have it pull and test against your lo= cal code. You will need to run it on a separate machine to avoid your own b= uilds and tests from interfering with the Hudson runs. = =3D=3D Branching =3D=3D = - Git makes it easy to branch. The recommended process for working with Apa= che projects is: one branch per JIRA issue. That makes it easy to isolate d= evelopment and track the development of each change. It does mean if you ha= ve your own branch that you release, one that merges in more than one issue= , you have to invest some effort in merging everything in. Try not to make = changes in different branches that are hard to merge, and learn your way ro= und the git rebase command to handle changes across branches. + Git makes it easy to branch. The recommended process for working with Apa= che projects is: one branch per JIRA issue. That makes it easy to isolate d= evelopment and track the development of each change. It does mean if you ha= ve your own branch that you release, one that merges in more than one issue= , you have to invest some effort in merging everything in. Try not to make = changes in different branches that are hard to merge, and learn your way ro= und the git rebase command to handle changes across branches. Better yet: d= o not use rebase once you have created a chain of branches that each depend= on each other - = - One thing you need to look out for is making sure that you are building t= he different Hadoop projects together; that you have not published on one b= ranch and built on another. This is because both Ivy and Maven publish arti= facts to shared repository cache directories. - = - 1. Don't be afraid to {{{rm -rf ~/.m2/repository/org/apache/hadoop}}} a= nd {{{rm -rf ~/.ivy2/cache/org.apache.hadoop}}} to remove local copies of a= rtifacts. - 1. Use different version properties in different branches to ensure that= different versions are not accidentally picked up - 1. Avoid using {{{latest.version}}} as the version marker in Ivy, as tha= t gives you the last built. - 1. Don't build/test different branches simultaneously, such as by runnin= g Hudson on your local machine while developing on the console. The trick h= ere is bring up Hudson in a virtual machine, running against the Git reposi= tory on your desktop. Git lets you do this, which lets you run Hudson again= st your private branch. = =3D=3D=3D Creating the branch =3D=3D=3D = Creating a branch is quick and easy {{{ - #start off in your trunk + #start off in the apache trunk git checkout trunk #create a new branch from trunk git branch HDFS-775 @@ -146, +102 @@ Assuming your trunk repository is in sync with the Apache projects, you c= an use {{{git diff}}} to create a patch file. First, have a directory for your patches: {{{ - mkdir ../outgoing + mkdir ../hadoop-patches }}} Then generate a patch file listing the differences between your trunk and= your branch {{{ - git diff --no-prefix trunk > ../outgoing/HDFS-775-1.patch + git diff --no-prefix trunk > ../hadoop-patches/HDFS-775-1.patch }}} The patch file is an extended version of the unified patch format used by= other tools; type {{{git help diff}}} to get more details on it. Here is w= hat the patch file in this example looks like {{{ @@ -183, +139 @@ }}} It is essential that patches for JIRA issues are generated with the {{{--= no-prefix}}} option. Without that an extra directory path is listed, and th= e patches can only be applied with a {{{patch -p1}}} call, ''which Hudson d= oes not know to do''. If you want your patches to take, this is what you ha= ve to do. You can of course test this yourself by using a command like {{{p= atch -p0 << ../outgoing/HDFS-775.1}}} in a copy of the SVN source tree to t= est that your patch takes. = + =3D=3D=3D Updating your patch =3D=3D=3D - If you have checked in your patch, then you need to refer to the patch by= name (SHA1 checksum), and that of the preceeding patch. If it was the last= patch and nothing else changed, this is easy. - {{{ - git diff --no-prefix HEAD~1 HEAD - }}} = + If your patch is not immediately accepted, do not be offended: it happens= to us all. It introduces a problem: your branches become out of date. You = need to check out the latest apache version, merge your branches with it, a= nd then push the changes back to github + = + {{{ + git co trunk + git pull apache + git co mybranch + git merge trunk + git push github mybranch + }}} + = + Your branch is up to date, and new diffs can be created and attached to p= atches. = + = + =3D=3D=3D Deriving Branches from Branches =3D=3D=3D + = + If you have one patch that depends upon another, you should have a separa= te branch for each one. Simply merge the changes from the first branch into= the second, so that it is always kept up to date with the first changes. T= o create a patch file for submission as a JIRA patch, do a diff between the= two branches, not against trunk. + = + '''do not play with rebasing once you start doing this as you will make m= erging a nightmare''' + = + =3D=3D=3D What to do when your patch is committed =3D=3D=3D + = + Once your patch is committed into SVN, you do not need the branch any mor= e. You can delete it straight away, but it is safer to verify the patch is = completely merged in + = + Pull down the latest release and verify that the patch branch is synchron= ized + = + {{{ + git co trunk + git pull apache + git co mybranch + git merge trunk + git diff trunk + }}} + = + the output of the last command should be nothing: the two branches should= be identical. You can then prove to git that this is true by switching bac= k to the trunk branch and merging in the branch, an operation which will no= t change the source tree, but update Git's branch graph. + = + {{{ + git co trunk + git merge mybranch + }}} + = + Now you can delete the branch without being warned by git + {{{ + git branch -d mybranch + }}} + = + Finally, propagate that deletion to your private github repository + {{{ + git push github :mybranch + }}} + = + This odd syntax says "push nothing to github/mybranch". + = + = + =3D=3D Building with a Git repository =3D=3D + = + ''The information below this line is relevant for versions of Hadoop befo= re 0.23.x, and should be considered obsolute for later versions. It is prob= ably out of date for Hadoop 0.22 as well.'' + = + =3D=3D Building the source =3D=3D + = + You need to tell all the Hadoop modules to get a local JAR of the bits of= Hadoop they depend on. You do this by making sure your Hadoop version does= not match anything public, and to use the "internal" repository of locally= published artifacts. + = + =3D=3D=3D Create a build.properties file =3D=3D=3D + = + Create a {{{build.properties}}} file. Do not do this in the git directori= es, do it one up. This is going to be a shared file. This article assumes y= ou are using Linux or a different Unix, incidentally. + = + Make the file something like this: + {{{ + #this is essential + resolvers=3Dinternal + #you can increment this number as you see fit + version=3D0.22.0-alpha-1 + project.version=3D${version} + hadoop.version=3D${version} + hadoop-core.version=3D${version} + hadoop-hdfs.version=3D${version} + hadoop-mapred.version=3D${version} + }}} + = + The {{{resolvers}}} property tells Ivy to look in the local maven artifac= t repository for versions of the Hadoop artifacts; if you don't set this th= en only published JARs from the central repostiory will get picked up. + = + The version property, and descendents, tells Hadoop which version of arti= facts to create and use. Set this to something different (ideally ahead of)= what is being published, to ensure that your own artifacts are picked up. + = + Next, symlink this file to every Hadoop module. Now a change in the file = gets picked up by all three. + {{{ + pushd common; ln -s ../build.properties build.properties; popd + pushd hdfs; ln -s ../build.properties build.properties; popd + pushd mapreduce; ln -s ../build.properties build.properties; popd + }}} + = + You are now all set up to build. + = + =3D=3D=3D Build Hadoop =3D=3D=3D + = + 1. In {{{common/}}} run {{{ant mvn-install}}} + 1. In {{{hdfs/}}} run {{{ant mvn-install}}} + 1. In {{{mapreduce/}}} run {{{ant mvn-install}}} + = + This Ant target not only builds the JAR files, it copies it to the local = {{{${user.home}/.m2}}} directory, where it will be picked up by the "intern= al" resolver. You can check that this is taking place by running {{{ant ivy= -report}}} on a project and seeing where it gets its dependencies. + = + '''Warning:''' it's easy for old JAR versions to get cached and picked up= . You will notice this early if something in hadoop-hdfs or hadoop-mapreduc= e doesn't compile, but if you are unlucky things do compile, just not work = as your updates are not picked up. Run {{{ant clean-cache}}} to fix this. = + = + By default, the trunk of the HDFS and mapreduce projects are set to grab = the snapshot versions that get built and published into the Apache snapshot= repository nightly. While this saves developers in these projects the comp= lexity of having to build and publish the upstream artifacts themselves, it= doesn't work if you do want to make changes to things like hadoop-common. = You need to make sure the local projects are picking up what's being built = locally. = + = + To check this in the hadoop-hdfs project, generate the Ivy dependency rep= orts using the internal resolver: + {{{ + ant ivy-report -Dresolvers=3Dinternal + }}} + = + Then browse to the report page listed at the bottom of the process, switc= h to the "common" tab, and look for hadoop-common JAR. It should have a pub= lication timestamp which contains the date and time of your local build. Fo= r example, the string " 20110211174419"> means the date 2011-02-11 and the = time of 17:44:19. If an older version is listed, you probably have it cache= d in the ivy cache -you can fix this by removing everything from the org.ap= ache corner of this cache. + = + {{{ + rm -rf ~/.ivy2/cache/org.apache.hadoop + }}} + = + Rerun the {{{ivy-report}}} target and check that the publication date is = current to verify that the version is now up to date. + = + = + =3D=3D=3D Testing =3D=3D=3D + = + Each project comes with lots of tests; run {{{ant test}}} to run the all,= {{{ant test-core}}} for the core tests. If you have made changes to the bu= ild and tests fail, it may be that the tests never worked on your machine. = Build and test the unmodified source first. Then keep an eye on both the ma= in source and any branch you make. A good way to do this is to give a Conti= nuous Integration server such as Hudson this job: checking out, building an= d testing both branches. + = + Remember, the way Git works, your machine's own repository is something t= hat other machines can fetch from. So in theory, you could set up a Hudson = server on another machine (or VM) and have it pull and test against your lo= cal code. You will need to run it on a separate machine to avoid your own b= uilds and tests from interfering with the Hudson runs. +=20