hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evert Lammerts <Evert.Lamme...@sara.nl>
Subject RE: Stability issue - dead DN's
Date Wed, 11 May 2011 13:36:11 GMT
Hi James,

Hadoop version is 0.20.2 (find that and more on our setup also in my first mail, under heading
"The cluster").

Below I) an example stacktrace of losing a datanode is and II) an example of a "Could not
obtain block" IOException.

Cheers,
Evert

11/05/11 15:06:43 INFO hdfs.DFSClient: Failed to connect to /192.168.28.214:50050, add to
deadNodes and continue
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready
for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.28.209:50726 remote=/192.168.28.214:50050]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readShort(DataInputStream.java:295)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1478)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:1)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
        at nl.liacs.infrawatch.hadoop.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:85)
        at nl.liacs.infrawatch.hadoop.kmeans.Job.run(Job.java:171)
        at nl.liacs.infrawatch.hadoop.kmeans.Job.main(Job.java:74)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


11/05/10 09:43:47 INFO mapred.JobClient:  map 82% reduce 17% 11/05/10 09:44:39 INFO mapred.JobClient:
Task Id :
attempt_201104121440_0122_m_000225_0, Status : FAILED
java.io.IOException: Could not obtain block:
blk_4397122445076815438_4097927
file=/user/joaquin/data/20081201/20081201.039
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1993)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1800)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
        at nl.liacs.infrawatch.hadoop.kmeans.KeyValueLineRecordReader.nextKeyValue(KeyValueLineRecordReader.java:94)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
        at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)

> -----Original Message-----
> From: James Seigel [mailto:james@tynt.com]
> Sent: woensdag 11 mei 2011 14:54
> To: general@hadoop.apache.org
> Subject: Re: Stability issue - dead DN's
>
> Evert,
>
> What's the stack trace and what version of hadoop do you have installed
> Sir!
>
> James.
> On 2011-05-11, at 3:23 AM, Evert Lammerts wrote:
>
> > Hi list,
> >
> > I notice that whenever our Hadoop installation is put under a heavy
> load we lose one or two (on a total of five) datanodes. This results in
> IOExceptions, and affects the overall performance of the job being run.
> Can anybody give me advise or best practices on a different
> configuration to increase the stability? Below I've included the specs
> of the cluster, the hadoop related config and an example of when which
> things go wrong. Any help is very much appreciated, and if I can
> provide any other info please let me know.
> >
> > Cheers,
> > Evert
> >
> > == What goes wrong, and when ==
> >
> > See attached a screenshot of Ganglia when the cluster is under load
> of a single job. This job:
> > * reads ~1TB from HDFS
> > * writes ~200GB to HDFS
> > * runs 288 Mappers and 35 Reducers
> >
> > When the job runs it takes all available Map and Reduce slots. The
> system starts swapping and there is a short time interval during which
> most cores are in WAIT. After that the job really starts running. At
> around half way, one or two datanodes become unreachable and are marked
> as dead nodes. The amount of under-replicated blocks becomes huge. Then
> some "java.io.IOException: Could not obtain block" are thrown in
> Mappers. The job does manage to finish successfully after around 3.5
> hours, but my fear is that when we make the input much larger - which
> we want - the system becomes too unstable to finish the job.
> >
> > Maybe worth mentioning - never know what might help diagnostics.  We
> notice that memory usage becomes less when we switch our keys from Text
> to LongWritable. Also, the Mappers are done in a fraction of the time.
> However, this for some reason results in much more network traffic and
> makes Reducers extremely slow. We're working on figuring out what
> causes this.
> >
> >
> > == The cluster ==
> >
> > We have a cluster that consists of 6 Sun Thumpers running Hadoop
> 0.20.2 on CentOS 5.5. One of them acts as NN and JT, the other 5 run
> DN's and TT's. Each node has:
> > * 16GB RAM
> > * 32GB swapspace
> > * 4 cores
> > * 11 LVM's of 4 x 500GB disks (2TB in total) for HDFS
> > * non-HDFS stuff on separate disks
> > * a 2x1GE bonded network interface for interconnects
> > * a 2x1GE bonded network interface for external access
> >
> > I realize that this is not a well balanced system, but it's what we
> had available for a prototype environment. We're working on putting
> together a specification for a much larger production environment.
> >
> >
> > == Hadoop config ==
> >
> > Here some properties that I think might be relevant:
> >
> > __CORE-SITE.XML__
> >
> > fs.inmemory.size.mb: 200
> > mapreduce.task.io.sort.factor: 100
> > mapreduce.task.io.sort.mb: 200
> > # 1024*1024*4 MB, blocksize of the LVM's
> > io.file.buffer.size: 4194304
> >
> > __HDFS-SITE.XML__
> >
> > # 1024*1024*4*32 MB, 32 times the blocksize of the LVM's
> > dfs.block.size: 134217728
> > # Only 5 DN's, but this shouldn't hurt
> > dfs.namenode.handler.count: 40
> > # This got rid of the occasional "Could not obtain block"'s
> > dfs.datanode.max.xcievers: 4096
> >
> > __MAPRED-SITE.XML__
> >
> > mapred.tasktracker.map.tasks.maximum: 4
> > mapred.tasktracker.reduce.tasks.maximum: 4
> > mapred.child.java.opts: -Xmx2560m
> > mapreduce.reduce.shuffle.parallelcopies: 20
> > mapreduce.map.java.opts: -Xmx512m
> > mapreduce.reduce.java.opts: -Xmx512m
> > # Compression codecs are configured and seem to work fine
> > mapred.compress.map.output: true
> > mapred.map.output.compression.codec:
> com.hadoop.compression.lzo.LzoCodec
> >


Mime
View raw message