hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evert Lammerts <Evert.Lamme...@sara.nl>
Subject RE: Stability issue - dead DN's
Date Fri, 13 May 2011 21:14:47 GMT
Hi Mike,

Thanks for trying to help out.

I had a talk with our networking guys this afternoon. According to them (and this is way out
of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be a problem.
We could set up a nameserver to resolve hostnames to addresses in our private space when the
request comes from one of the nodes, and route this traffic over a single interface. Any other
request can be resolved to an address in the public space, which is bound to an other interface.
In our current setup we're not even resolving hostnames in our private address space through
a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work alright.

Having said that, our problems are still not completely gone even after adjusting the maximum
allowed RAM for tasks - although things are lots better. While writing this mail three out
of five DN's were marked as dead. There still is some swapping going on, but the cores are
not spending any time in WAIT, so this shouldn't be the cause of anything. See below a trace
from a dead DN - any thoughts are appreciated!

Cheers,
Evert

2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050 of
size 382425
2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying
to read 3744913 bytes
2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0,
offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 6254000
2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050 of
size 245767
2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888
unfinalized and removed. 
2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9132067116195286882_130888
received exception java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 3744913 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0,
offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4323000
2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0,
offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 5573000
2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0,
offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 16939000
2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050 of
size 300441
2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0,
offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 5291000
2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0,
offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834,
duration: 5099000
2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0,
offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4945000
2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0,
offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4159000
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying
to read 100 bytes
2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975
unfinalized and removed. 
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9140444589483291821_3585975
received exception java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 100 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0,
offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274,
duration: 5625000
2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0,
offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4473000
2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):
Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050
java.io.IOException: The stream is closed
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)

2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0,
offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504,
duration: 5597000
2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050 of
size 3033173
2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050 of
size 242383

________________________________________
From: Segel, Mike [msegel@navteq.com]
Sent: Friday, May 13, 2011 2:36 PM
To: general@hadoop.apache.org
Cc: <cdh-user@cloudera.org>; <general@hadoop.apache.org>
Subject: Re: Stability issue - dead DN's

Bonded will work but you may not see the performance you would expect.  If you need >1
GBe, go 10GBe less headache and has even more headroom.

Multiple interfaces won't work. Or I should say didn't work in past releases.
If you think about it, clients have to connect to each node. So having two interfaces and
trying to manage them makes no sense.

Add to this trying to manage this in DNS ... Why make more work for yourself?
Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces
had to match hostnames so you had an inverted network.

If you draw out your network topology you end up with a ladder.
You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd.
But then if your cluster is for development... Now your PCs can't be used as clients...

Does this make sense?


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Evert.Lammerts@sara.nl> wrote:

> Hi Mike,
>
>> You really really don't want to do this.
>> Long story short... It won't work.
>
> Can you elaborate? Are you talking about the bonded interfaces or about having a separated
network for interconnects and external network? What can go wrong there?
>
>>
>> Just a suggestion.. You don't want anyone on your cluster itself. They
>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>> cluster has a single network to worry about.
>
> That's our current setup. We have a single headnode that is used as a SPOE. However,
I'd like to change that on our future production system. We want to implement Kerberos for
authentication, and let users interact with the cluster from their own machine. This would
enable them to submit their jobs from the local IDE. The only way to do this is by opening
up Hadoop ports for the world, is my understanding: if people interact with HDFS they need
to be able to interact with all nodes, right? What would be the argument against this?
>
> Cheers,
> Evert
>
>>
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw@apache.org> wrote:
>>
>>>
>>>
>>>
>>>
>>>>> * a 2x1GE bonded network interface for interconnects
>>>>> * a 2x1GE bonded network interface for external access
>>>
>>>   Multiple NICs on a box can sometimes cause big performance
>> problems with Hadoop.  So watch your traffic carefully.
>>>
>>>
>>>


The information contained in this communication may be CONFIDENTIAL and is intended only for
the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby
notified that any dissemination, distribution, or copying of this communication, or any of
its contents, is strictly prohibited.  If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it from your computer
or paper files.

Mime
View raw message