hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drake민영근 <drake....@nexr.com>
Subject Re: rolling upgrade(2.4.1 to 2.6.0) problem
Date Tue, 28 Apr 2015 04:31:45 GMT
Hi,

IMHO, Upgrade *with downtime* after 2.7.1 is the best option left.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Mon, Apr 27, 2015 at 5:46 PM, Nitin Pawar <nitinpawar432@gmail.com>
wrote:

> I had read somewhere 2.7 has lots of issues so you should wait for 2.7.1
> where most of them are getting addressed
>
> On Mon, Apr 27, 2015 at 2:14 PM, 조주일 <tjstory@kgrid.co.kr> wrote:
>
>>
>>
>> I think heartbeat failure cause is hang of nodes.
>>
>> I found a bug report associated with this problem.
>>
>>
>>
>> https://issues.apache.org/jira/browse/HDFS-7489
>>
>> https://issues.apache.org/jira/browse/HDFS-7496
>>
>> https://issues.apache.org/jira/browse/HDFS-7531
>>
>> https://issues.apache.org/jira/browse/HDFS-8051
>>
>>
>>
>> It has been fixed in 2.7.
>>
>>
>>
>> I do not have experience patch.
>>
>> And Because of this stability has not been confirmed, I can not upgrade
>> to 2.7.
>>
>>
>>
>> What do you recommend for that?
>>
>>
>>
>> How can I do the patch, if I will do patch?
>>
>> Can I patch without service dowtime.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> *From:* "Drake민영근"<drake.min@nexr.com>
>> *To:* "user"<user@hadoop.apache.org>; "조주일"<tjstory@kgrid.co.kr>;
>> *Cc:*
>> *Sent:* 2015-04-24 (금) 17:41:59
>> *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem
>>
>>
>> Hi,
>>
>> I think limited by "max user processes". see this:
>> https://plumbr.eu/outofmemoryerror/unable-to-create-new-native-thread In
>> your case, user cannot create more than 10240 processes. In our env, the
>> limit is more like "65000".
>>
>> I think it's worth a try. And, if hdfs datanode daemon's user is not
>> root, set the limit file into /etc/security/limits.d
>>
>> Thanks.
>>
>> Drake 민영근 Ph.D
>> kt NexR
>>
>> On Fri, Apr 24, 2015 at 5:15 PM, 조주일 <tjstory@kgrid.co.kr> wrote:
>>
>> ulimit -a
>>
>> core file size          (blocks, -c) 0
>>
>> data seg size           (kbytes, -d) unlimited
>>
>> scheduling priority             (-e) 0
>>
>> file size               (blocks, -f) unlimited
>>
>> pending signals                 (-i) 62580
>>
>> max locked memory       (kbytes, -l) 64
>>
>> max memory size         (kbytes, -m) unlimited
>>
>> open files                      (-n) 102400
>>
>> pipe size            (512 bytes, -p) 8
>>
>> POSIX message queues     (bytes, -q) 819200
>>
>> real-time priority              (-r) 0
>>
>> stack size              (kbytes, -s) 10240
>>
>> cpu time               (seconds, -t) unlimited
>>
>> max user processes              (-u) 10240
>>
>> virtual memory          (kbytes, -v) unlimited
>>
>> file locks                      (-x) unlimited
>>
>>
>>
>> ------------------------------------------------------
>>
>> Hadoop cluster was operating normally in the 2.4.1 version.
>>
>> Hadoop cluster is a problem in version 2.6.
>>
>>
>>
>> E.g
>>
>>
>>
>> Slow BlockReceiver logs are often seen
>>
>> "org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver
>> write data to disk cost"
>>
>>
>>
>> If the data node failure and under-block occurs,
>>
>> another many nodes heartbeat check is fails.
>>
>> So, I stop all nodes and I start all nodes.
>>
>> The cluster is then normalized.
>>
>>
>>
>> In this regard, Hadoop Is there a difference between version 2.4 and 2.6?
>>
>>
>>
>>
>>
>> ulimit -a
>>
>> core file size          (blocks, -c) 0
>>
>> data seg size           (kbytes, -d) unlimited
>>
>> scheduling priority             (-e) 0
>>
>> file size               (blocks, -f) unlimited
>>
>> pending signals                 (-i) 62580
>>
>> max locked memory       (kbytes, -l) 64
>>
>> max memory size         (kbytes, -m) unlimited
>>
>> open files                      (-n) 102400
>>
>> pipe size            (512 bytes, -p) 8
>>
>> POSIX message queues     (bytes, -q) 819200
>>
>> real-time priority              (-r) 0
>>
>> stack size              (kbytes, -s) 10240
>>
>> cpu time               (seconds, -t) unlimited
>>
>> max user processes              (-u) 10240
>>
>> virtual memory          (kbytes, -v) unlimited
>>
>> file locks                      (-x) unlimited
>>
>>
>>
>>
>>
>> -----Original Message-----
>> *From:* "Drake민영근"<drake.min@nexr.com>
>> *To:* "user"<user@hadoop.apache.org>; "조주일"<tjstory@kgrid.co.kr>;
>> *Cc:*
>> *Sent:* 2015-04-24 (금) 16:58:46
>> *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem
>>
>> HI,
>>
>> How about the ulimit setting of the user for hdfs datanode ?
>>
>> Drake 민영근 Ph.D
>> kt NexR
>>
>> On Wed, Apr 22, 2015 at 6:25 PM, 조주일 <tjstory@kgrid.co.kr> wrote:
>>
>>
>>
>> I allocated 5G.
>>
>> I think OOM is not the cause of essentially
>>
>>
>>
>> -----Original Message-----
>> *From:* "Han-Cheol Cho"<hancheol.cho@nhn-playart.com>
>> *To:* <user@hadoop.apache.org>;
>> *Cc:*
>> *Sent:* 2015-04-22 (수) 15:32:35
>> *Subject:* RE: rolling upgrade(2.4.1 to 2.6.0) problem
>>
>>
>> Hi,
>>
>>
>>
>> The first warning shows out-of-memory error of JVM.
>>
>> Did you give enough max heap memory for DataNode daemons?
>>
>> DN daemons, by default, uses max heap size 1GB. So if your DN requires
>> more
>>
>> than that, it will be in a trouble.
>>
>>
>> You can check the memory consumption of you DN dameons (e.g., top
>> command)
>>
>> and the memory allocated to them by -Xmx option (e.g., jps -lmv).
>>
>> If the max heap size is too small, you can use HADOOP_DATANODE_OPTS
>> variable
>>
>> (e.g., HADOOP_DATANODE_OPTS="-Xmx4g") to override it.
>>
>>
>>
>> Best wishes,
>>
>> Han-Cheol
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> *From:* "조주일"<tjstory@kgrid.co.kr>
>> *To:* <user@hadoop.apache.org>;
>> *Cc:*
>> *Sent:* 2015-04-22 (수) 14:54:16
>> *Subject:* rolling upgrade(2.4.1 to 2.6.0) problem
>>
>>
>>
>>
>> My Cluster is..
>>
>> hadoop 2.4.1
>>
>> Capacity : 1.24PB
>>
>> Used 1.1PB
>>
>> 16 Datanodes
>>
>> Each node is a capacity of 65TB, 96TB, 80TB, Etc..
>>
>>
>>
>> I had to proceed with the rolling upgrade 2.4.1 to 2.6.0.
>>
>> A data node upgraded takes about 40 minutes.
>>
>> Occurs during the upgrade is in progress under-block.
>>
>>
>>
>> 10 nodes completed upgrade 2.6.0.
>>
>> Had a problem at some point during a rolling upgrade of the remaining
>> nodes.
>>
>>
>>
>> Heartbeat of the many nodes(2.6.0 only) has failed.
>>
>>
>>
>> I did changes the following attributes but I did not fix the problem,
>>
>> dfs.datanode.handler.count = 100 ---> 300, 400, 500
>>
>> dfs.datanode.max.transfer.threads = 4096 ---> 8000, 10000
>>
>>
>>
>> I think,
>>
>> 1. Something that causes a delay in processing threads. I think it may
>> be because the block replication between different versions.
>>
>> 2. Whereby the many handlers and xceiver became necessary.
>>
>> 3. Whereby the out of memory, an error occurs. Or the problem arises on
>> a datanode.
>>
>> 4. Heartbeat fails, and datanode dies.
>>
>>
>> I found a datanode error log for the following:
>>
>> However, it is impossible to determine the cause.
>>
>>
>>
>> I think, therefore I am. Called because it blocks the replication between
>> different versions
>>
>>
>>
>> Give me someone help me !!
>>
>>
>>
>> DATANODE LOG
>>
>> --------------------------------------------------------------------------
>>
>> ### I had to check a few thousand close_wait connection from the datanode.
>>
>>
>>
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
>> packet to mirror took 1207ms (threshold=300ms)
>>
>>
>>
>> 2015-04-21 22:46:01,772 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory.
>> Will retry in 30 seconds.
>>
>> java.lang.OutOfMemoryError: unable to create new native thread
>>
>>         at java.lang.Thread.start0(Native Method)
>>
>>         at java.lang.Thread.start(Thread.java:640)
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145)
>>
>>         at java.lang.Thread.run(Thread.java:662)
>>
>> 2015-04-21 22:49:45,378 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver
>> count 8193 exceeds the limit of concurrent xcievers: 8192
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
>>
>>         at java.lang.Thread.run(Thread.java:662)
>>
>> 2015-04-22 01:01:25,632 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver
>> count 8193 exceeds the limit of concurrent xcievers: 8192
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
>>
>>         at java.lang.Thread.run(Thread.java:662)
>>
>> 2015-04-22 03:49:44,125 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK
>> operation  src: /192.168.2.174:45606 dst: /192.168.1.204:40010
>>
>> java.io.IOException: cannot find BPOfferService for
>> bpid=BP-1770955034-0.0.0.0-1401163460236
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387)
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470)
>>
>>         at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
>>
>>         at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>>
>>         at java.lang.Thread.run(Thread.java:662)
>>
>> 2015-04-22 05:30:28,947 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(192.168.1.203,
>> datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075,
>> ipcPort=40020,
>> storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed
>> to transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to
>> 192.168.2.156:40010 got
>>
>> java.net.SocketException: Original Exception : java.io.IOException:
>> Connection reset by peer
>>
>>         at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
>>
>>         at
>> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405)
>>
>>         at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506)
>>
>>         at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559)
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:728)
>>
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2017)
>>
>>         at java.lang.Thread.run(Thread.java:662)
>>
>> Caused by: java.io.IOException: Connection reset by peer
>>
>>         ... 8 more
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
> --
> Nitin Pawar
>

Mime
View raw message