Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of afaris@linkedin.com
 designates 69.28.149.81 as permitted sender)
DomainKey-Signature: s=prod; d=linkedin.com; c=nofws; q=dns;
  h=X-IronPort-AV:Received:From:To:Subject:Thread-Topic:
   Thread-Index:Date:Message-ID:References:In-Reply-To:
   Accept-Language:Content-Language:X-MS-Has-Attach:
   X-MS-TNEF-Correlator:x-originating-ip:Content-Type:
   Content-ID:Content-Transfer-Encoding:MIME-Version;
  b=oHO7nD8HxaVvnLYX82TK9Yx36gz1+lfLBQzLeL8sGxO8dfE9q6dpwYw5
   2ZY/j7qeM12kLA0OrCCtE5ByTFICxSZ6lnYazChrVwDd67kWZQexJJPG3
   f+ktlP9MIg/dhh/;
From: Adam Faris <afaris@linkedin.com>
To: "<common-user@hadoop.apache.org>" <common-user@hadoop.apache.org>
Subject: Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Thread-Topic: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Thread-Index: 
 AQHNKPOY1YsIMTpebkSQUkKdDRVyNpa4FmoAgAAHz4CAACdIgIAAAowAgAABFQCAAAhMAIAABGgAgAABsYCAAAraAIAAQueAgAARHoCAAG4AgIAFdu8AgAANL4CAAARdAIAANGSA
Date: Mon, 7 May 2012 14:37:00 +0000
Message-ID: <BB878C46-57F5-4DC7-BB6D-D9A37420EB99@linkedin.com>
References: 
 <CAFyQ0S7x7WCuCq75MSvsBLxLPJ0MX1SXn_LKk2EZ7ODo5BNViQ@mail.gmail.com>
 <CAORpBsjNPaQV8WW5NmL6zH=1FuNguvrXNdzVs4Cg6vhVw6=PiA@mail.gmail.com>
 <CAFyQ0S7U1DGWUEC-wwB=w7-F9+LFFnZhaVH-gTEj9=O9_vZCdQ@mail.gmail.com>
 <CAORpBsgZa_rpB7GRQd5gn1stX8bxtBixBX5N3DGOWeOLww2MNg@mail.gmail.com>
 <CAFyQ0S4vYn_eFEBovEuYx9mATBmTfP_ZvjH0Q6BhUAV9ELmZQA@mail.gmail.com>
 <CAOehgT=ShC+Y_xq4gDN30+kcZKyefzHc3mD4S6w4+gPH0cSUzw@mail.gmail.com>
 <CAFyQ0S7iONqDqp8MBccXzQS0vjYpuEznb7QWF3edh7SWhODZXw@mail.gmail.com>
 <BLU0-SMTP436AEFBFE61C9EE4BE4B1A8F2F0@phx.gbl>
 <CAFyQ0S6S8kF1fzx7v=wejy1Q-ZDbwBpKYJAWXc9kkz1MnvTARg@mail.gmail.com>
 <BLU0-SMTP41111CAD470DDFCA4BC06928F2F0@phx.gbl>
 <CAENxBwwgpThu8Z-8je+hd=XtO1S52fnbwGF=Je0dYOV9m+U5ow@mail.gmail.com>
 <CADdVdVE2ohbyvPyx29Yr6QK-GTJm9+KhzhNcALiREKzbu0pS+w@mail.gmail.com>
 <BLU0-SMTP247EFD661D1D464E841BAEA8F2F0@phx.gbl>
 <CAFyQ0S4VNr1YGBJo6REw5jruhw0NDmLsa_Q_SM7DyceQjQzzKQ@mail.gmail.com>
 <CAFyQ0S7iVz6bTOU0y1htFDrdyBA8PekXzJOQjDNmaXmyyOkFOg@mail.gmail.com>
 <CAORpBshSxwJwtyipek8Rv7qrxAvEnk=xEOHZ3KLaY-TaAYcabA@mail.gmail.com>
In-Reply-To: 
 <CAORpBshSxwJwtyipek8Rv7qrxAvEnk=xEOHZ3KLaY-TaAYcabA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-ID: <8672244A730B814C9008F6938886E645@linkedin.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi Austin,

I don't know about using CDH3, but we use distcp for moving data between di=
fferent versions of apache grids and several things come to mind.

1) you should use the -i flag to ignore checksum differences on the blocks.=
  I'm not 100% but want to say hftp doesn't support checksums on the blocks=
 as they go across the wire.

2) you should read from hftp but write to hdfs.  Also make sure to check yo=
ur port numbers.   For example I can read from hftp on port 50070 and write=
 to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml and hdfs=
 in core-site.xml on apache releases.

3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support s=
ecurity?  If security is enabled on 0.20.205 and CDH3 does not support secu=
rity, you will need to disable security on 0.20.205.  This is because you a=
re unable to write from a secure to unsecured grid.

4) use the -m flag to limit your mappers so you don't DDOS your network bac=
kbone.  =20

5) why isn't your vender helping you with the data migration? :) =20

Otherwise something like this should get you going.

hadoop -i -ppgu -log /tmp/mylog -m 20 distcp hftp://mynamenode.grid.one:500=
70/path/to/my/src/data hdfs://mynamenode.grid.two:9000/path/to/my/dst=20

-- Adam

On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:

> things to check
>=20
> 1) when you launch distcp jobs all the datanodes of older hdfs are live a=
nd
> connected
> 2) when you launch distcp no data is being written/moved/deleteed in hdfs
> 3)  you can use option -log to log errors into directory and user -i to
> ignore errors
>=20
> also u can try using distcp with hdfs protocol instead of hftp  ... for
> more you can refer
> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thr=
ead/d0d99ad9f1554edd
>=20
>=20
>=20
> if it failed there should be some error
> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <austincv@gmail.com> wrot=
e:
>=20
>> ok that was a lame mistake.
>> $ hadoop distcp hftp://localhost:50070/tmp hftp://localhost:60070/tmp_co=
py
>> I had spelled hdfs instead of "hftp"
>>=20
>> $ hadoop distcp hftp://localhost:50070/docs/index.html
>> hftp://localhost:60070/user/hadoop
>> 12/05/07 16:38:09 INFO tools.DistCp:
>> srcPaths=3D[hftp://localhost:50070/docs/index.html]
>> 12/05/07 16:38:09 INFO tools.DistCp:
>> destPath=3Dhftp://localhost:60070/user/hadoop
>> With failures, global counters are inaccurate; consider running with -i
>> Copy failed: java.io.IOException: Not supported
>> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
>>=20
>> Any idea why this error is coming?
>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
>> (/user/hadoop)
>>=20
>> Thanks & Regards,
>> Austin
>>=20
>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <austincv@gmail.com>
>> wrote:
>>=20
>>> Thanks,
>>>=20
>>> So I decided to try and move using distcp.
>>>=20
>>> $ hadoop distcp hdfs://localhost:54310/tmp hdfs://localhost:8021/tmp_co=
py
>>> 12/05/07 14:57:38 INFO tools.DistCp:
>> srcPaths=3D[hdfs://localhost:54310/tmp]
>>> 12/05/07 14:57:38 INFO tools.DistCp:
>>> destPath=3Dhdfs://localhost:8021/tmp_copy
>>> With failures, global counters are inaccurate; consider running with -i
>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (clien=
t
>> =3D
>>> 63, server =3D 61)
>>>=20
>>> I found that we can do distcp like above only if both are of the same
>>> hadoop version.
>>> so I tried:
>>>=20
>>> $ hadoop distcp hftp://localhost:50070/tmp
>> hdfs://localhost:60070/tmp_copy
>>> 12/05/07 15:02:44 INFO tools.DistCp:
>> srcPaths=3D[hftp://localhost:50070/tmp]
>>> 12/05/07 15:02:44 INFO tools.DistCp:
>>> destPath=3Dhdfs://localhost:60070/tmp_copy
>>>=20
>>> But this process seemed to be hangs at this stage. What might I be doin=
g
>>> wrong?
>>>=20
>>> hftp://<dfs.http.address>/<path>
>>> hftp://localhost:50070 is dfs.http.address of 0.20.205
>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3
>>>=20
>>> Thanks and regards,
>>> Austin
>>>=20
>>>=20
>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <michael_segel@hotmail.com
>>> wrote:
>>>=20
>>>> Ok... So riddle me this...
>>>> I currently have a replication factor of 3.
>>>> I reset it to two.
>>>>=20
>>>> What do you have to do to get the replication factor of 3 down to 2?
>>>> Do I just try to rebalance the nodes?
>>>>=20
>>>> The point is that you are looking at a very small cluster.
>>>> You may want to start the be cluster with a replication factor of 2 an=
d
>>>> then when the data is moved over, increase it to a factor of 3. Or may=
be
>>>> not.
>>>>=20
>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck fo=
r
>> a
>>>> sanity check and then remove the files I copied. As I gain more room, =
I
>> can
>>>> then slowly drop nodes, do an fsck, rebalance and then repeat.
>>>>=20
>>>> Even though this us a dev cluster, the OP wants to retain the data.
>>>>=20
>>>> There are other options depending on the amount and size of new
>> hardware.
>>>> I mean make one machine a RAID 5 machine, copy data to it clearing off
>>>> the cluster.
>>>>=20
>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
>>>> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit =
it
>>>> on one machine, depending on hardware, or maybe 2 machines...  Now you
>> can
>>>> rebuild initial cluster and then move data back. Then rebuild those
>>>> machines. Lots of options... ;-)
>>>>=20
>>>> Sent from a remote device. Please excuse any typos...
>>>>=20
>>>> Mike Segel
>>>>=20
>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <suresh@hortonworks.com>
>>>> wrote:
>>>>=20
>>>>> This probably is a more relevant question in CDH mailing lists. That
>>>> said,
>>>>> what Edward is suggesting seems reasonable. Reduce replication factor=
,
>>>>> decommission some of the nodes and create a new cluster with those
>> nodes
>>>>> and do distcp.
>>>>>=20
>>>>> Could you share with us the reasons you want to migrate from Apache
>> 205?
>>>>>=20
>>>>> Regards,
>>>>> Suresh
>>>>>=20
>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
>> edlinuxguru@gmail.com
>>>>> wrote:
>>>>>=20
>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick it
>>>>>> out. But yes like Michael said two clusters on the same gear and
>>>>>> distcp. If you are using RF=3D3 you could also lower your replicatio=
n
>> to
>>>>>> rf=3D2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>>>>>> stuff.
>>>>>>=20
>>>>>>=20
>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
>>>> michael_segel@hotmail.com>
>>>>>> wrote:
>>>>>>> Ok... When you get your new hardware...
>>>>>>>=20
>>>>>>> Set up one server as your new NN, JT, SN.
>>>>>>> Set up the others as a DN.
>>>>>>> (Cloudera CDH3u3)
>>>>>>>=20
>>>>>>> On your existing cluster...
>>>>>>> Remove your old log files, temp files on HDFS anything you don't
>> need.
>>>>>>> This should give you some more space.
>>>>>>> Start copying some of the directories/files to the new cluster.
>>>>>>> As you gain space, decommission a node, rebalance, add node to new
>>>>>> cluster...
>>>>>>>=20
>>>>>>> It's a slow process.
>>>>>>>=20
>>>>>>> Should I remind you to make sure you up you bandwidth setting, and
>> to
>>>>>> clean up the hdfs directories when you repurpose the nodes?
>>>>>>>=20
>>>>>>> Does this make sense?
>>>>>>>=20
>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>=20
>>>>>>> Mike Segel
>>>>>>>=20
>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <austincv@gmail.com>
>>>> wrote:
>>>>>>>=20
>>>>>>>> Yeah I know :-)
>>>>>>>> and this is not a production cluster ;-) and yes there is more
>>>> hardware
>>>>>>>> coming :-)
>>>>>>>>=20
>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
>>>> michael_segel@hotmail.com
>>>>>>> wrote:
>>>>>>>>=20
>>>>>>>>> Well, you've kind of painted yourself in to a corner...
>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists,
>> but
>>>>>> it's a
>>>>>>>>> generic question...
>>>>>>>>>=20
>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual disks=
?
>>>>>>>>> And please tell me you've already ordered more hardware.. Right?
>>>>>>>>>=20
>>>>>>>>> And please tell me this isn't your production cluster...
>>>>>>>>>=20
>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept m=
y
>>>>>>>>> upcoming proposal talk... ;-)
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>>=20
>>>>>>>>> Mike Segel
>>>>>>>>>=20
>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <austincv@gmail.com>
>>>>>> wrote:
>>>>>>>>>=20
>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There
>>>> were no
>>>>>>>>>> responses.
>>>>>>>>>>=20
>>>>>>>>>> But this is not related to cloudera as such.
>>>>>>>>>>=20
>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
>>>> apache
>>>>>>>>>> hadoop 0.20.205
>>>>>>>>>>=20
>>>>>>>>>> There is an upgrade namenode option when we are migrating to a
>>>> higher
>>>>>>>>>> version say from 0.20 to 0.20.205
>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>>>>>>>> Is this possible?
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>>>>>> prash1784@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>>=20
>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
>> would
>>>> not
>>>>>>>>> know
>>>>>>>>>>> much, but you might find some help moving this to Cloudera
>> mailing
>>>>>> list.
>>>>>>>>>>>=20
>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
>>>> austincv@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>=20
>>>>>>>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>>>>>>>=20
>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB storag=
e
>>>>>>>>> capacity
>>>>>>>>>>>> and has about 8 TB of data.
>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use tha=
t
>>>>>> same 8
>>>>>>>>> TB
>>>>>>>>>>>> of data.
>>>>>>>>>>>>=20
>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 T=
B
>>>> of
>>>>>> free
>>>>>>>>>>>> space
>>>>>>>>>>>>=20
>>>>>>>>>>>>=20
>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>>>>>> nitinpawar432@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>=20
>>>>>>>>>>>>> you can actually look at the distcp
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> but this means that you have two different set of clusters
>>>>>> available
>>>>>>>>> to
>>>>>>>>>>>> do
>>>>>>>>>>>>> the migration
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>>>>>> austincv@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Thanks for the suggestions,
>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the
>> dfs
>>>>>>>>>>> because
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> data is huge.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
>> can
>>>> do
>>>>>> a
>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to us=
e
>>>> CDH3
>>>>>>>>>>> now,
>>>>>>>>>>>>>> which is based on 0.20
>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
>> has
>>>> to
>>>>>> be
>>>>>>>>>>>> used
>>>>>>>>>>>>>> by 0.20's namenode.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>>>>>>>> nitinpawar432@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>> i can think of following options
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data from
>>>> DFS
>>>>>> and
>>>>>>>>>>>>> loads
>>>>>>>>>>>>>>> it in dfs
>>>>>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few
>> hundred
>>>>>> GB)
>>>>>>>>>>> ..
>>>>>>>>>>>>>> did a
>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
>> copyFromLocal
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>>>>>>>> austincv@gmail.com
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of Apach=
e
>>>>>>>>>>> hadoop
>>>>>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have o=
n
>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>> Austin
>>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>=20
>>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>=20
>>>>=20
>>>=20
>>>=20
>>=20
>=20
>=20
>=20
> --=20
> Nitin Pawar