hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evert Lammerts <Evert.Lamme...@sara.nl>
Subject RE: Stability issue - dead DN's
Date Sat, 14 May 2011 20:08:01 GMT
Just to check: the NN gives back hostnames of DN's to the client when getting or putting data,
and not IP addresses right?

Cheers,
Evert

________________________________________
From: Evert Lammerts [Evert.Lammerts@sara.nl]
Sent: Saturday, May 14, 2011 10:53 AM
To: general@hadoop.apache.org
Subject: RE: Stability issue - dead DN's

Ok, I'll give this scenario a try (in spite of the intoxication ;-)).

= putting or getting a file =
A client will access the NameNode first and get a list of hostnames. These will resolve to
addresses either in public or in private space, depending on whether the request to the nameserver
was made by a machine in public or in private space. Each node has one NIC listening on its
address in private space and one on its address in public space. The Hadoop daemons are bound
to 0.0.0.0:*. Reverse DNS will return an address in private space when the client connects
from one of the nodes, and (obviously) an address in public space when the request came through
WAN.

I'm not sure what could go wrong here... On Monday I'll recheck this scenario with our HPN
guys as well.

Cheers,
Evert


________________________________________
From: Segel, Mike [msegel@navteq.com]
Sent: Saturday, May 14, 2011 12:33 AM
To: general@hadoop.apache.org
Subject: Re: Stability issue - dead DN's

Ok...

Hum, look, I've been force fed a couple of margaritas so, my memory is a bit foggy...
You say your clients connect on nic A. Your cluster connects on nic B.

What happens when you want to upload a file from your client to HDFS? Or even access it?

... ;-)



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:15 PM, "Evert Lammerts" <Evert.Lammerts@sara.nl> wrote:

> Hi Mike,
>
> Thanks for trying to help out.
>
> I had a talk with our networking guys this afternoon. According to them (and this is
way out of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be
a problem. We could set up a nameserver to resolve hostnames to addresses in our private space
when the request comes from one of the nodes, and route this traffic over a single interface.
Any other request can be resolved to an address in the public space, which is bound to an
other interface. In our current setup we're not even resolving hostnames in our private address
space through a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work
alright.
>
> Having said that, our problems are still not completely gone even after adjusting the
maximum allowed RAM for tasks - although things are lots better. While writing this mail three
out of five DN's were marked as dead. There still is some swapping going on, but the cores
are not spending any time in WAIT, so this shouldn't be the cause of anything. See below a
trace from a dead DN - any thoughts are appreciated!
>
> Cheers,
> Evert
>
> 2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received
block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050
of size 382425
> 2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
in receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying
to read 3744913 bytes
> 2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0,
offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 6254000
> 2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received
block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050
of size 245767
> 2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888
unfinalized and removed.
> 2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-9132067116195286882_130888 received exception java.io.EOFException: while trying to read
3744913 bytes
> 2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 3744913 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0,
offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4323000
> 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0,
offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 5573000
> 2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0,
offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 16939000
> 2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received
block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050
of size 300441
> 2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0,
offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 5291000
> 2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0,
offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834,
duration: 5099000
> 2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0,
offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4945000
> 2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0,
offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4159000
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
in receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying
to read 100 bytes
> 2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975
unfinalized and removed.
> 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-9140444589483291821_3585975 received exception java.io.EOFException: while trying to
read 100 bytes
> 2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 100 bytes
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0,
offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274,
duration: 5625000
> 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0,
offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368,
duration: 4473000
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):
Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050
> java.io.IOException: The stream is closed
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
>        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
>
> 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0,
offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504,
duration: 5597000
> 2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received
block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050
of size 3033173
> 2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received
block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050
of size 242383
>
> ________________________________________
> From: Segel, Mike [msegel@navteq.com]
> Sent: Friday, May 13, 2011 2:36 PM
> To: general@hadoop.apache.org
> Cc: <cdh-user@cloudera.org>; <general@hadoop.apache.org>
> Subject: Re: Stability issue - dead DN's
>
> Bonded will work but you may not see the performance you would expect.  If you need >1
GBe, go 10GBe less headache and has even more headroom.
>
> Multiple interfaces won't work. Or I should say didn't work in past releases.
> If you think about it, clients have to connect to each node. So having two interfaces
and trying to manage them makes no sense.
>
> Add to this trying to manage this in DNS ... Why make more work for yourself?
> Going from memory... It looked like you rDNS had to match you hostnames so your internal
interfaces had to match hostnames so you had an inverted network.
>
> If you draw out your network topology you end up with a ladder.
> You would be better off (IMHO) to create a subnet where only your edge servers are dual
nic'd.
> But then if your cluster is for development... Now your PCs can't be used as clients...
>
> Does this make sense?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 13, 2011, at 4:57 AM, "Evert Lammerts" <Evert.Lammerts@sara.nl> wrote:
>
>> Hi Mike,
>>
>>> You really really don't want to do this.
>>> Long story short... It won't work.
>>
>> Can you elaborate? Are you talking about the bonded interfaces or about having a
separated network for interconnects and external network? What can go wrong there?
>>
>>>
>>> Just a suggestion.. You don't want anyone on your cluster itself. They
>>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>>> cluster has a single network to worry about.
>>
>> That's our current setup. We have a single headnode that is used as a SPOE. However,
I'd like to change that on our future production system. We want to implement Kerberos for
authentication, and let users interact with the cluster from their own machine. This would
enable them to submit their jobs from the local IDE. The only way to do this is by opening
up Hadoop ports for the world, is my understanding: if people interact with HDFS they need
to be able to interact with all nodes, right? What would be the argument against this?
>>
>> Cheers,
>> Evert
>>
>>>
>>>
>>> Sent from a remote device. Please excuse any typos...
>>>
>>> Mike Segel
>>>
>>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <aw@apache.org> wrote:
>>>
>>>>
>>>>
>>>>
>>>>
>>>>>> * a 2x1GE bonded network interface for interconnects
>>>>>> * a 2x1GE bonded network interface for external access
>>>>
>>>>  Multiple NICs on a box can sometimes cause big performance
>>> problems with Hadoop.  So watch your traffic carefully.
>>>>
>>>>
>>>>
>
>
> The information contained in this communication may be CONFIDENTIAL and is intended only
for the use of the recipient(s) named above.  If you are not the intended recipient, you are
hereby notified that any dissemination, distribution, or copying of this communication, or
any of its contents, is strictly prohibited.  If you have received this communication in error,
please notify the sender and delete/destroy the original message and any copy of it from your
computer or paper files.


The information contained in this communication may be CONFIDENTIAL and is intended only for
the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby
notified that any dissemination, distribution, or copying of this communication, or any of
its contents, is strictly prohibited.  If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it from your computer
or paper files.

Mime
View raw message