hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Idris Ali <psychid...@gmail.com>
Subject Re: Too many open files Error
Date Fri, 27 Jan 2012 06:37:31 GMT
Hi Mark,

As Harsh pointed out it is not good idea to increase the Xceiver count to
arbitrarily higher value, I suggested to increase the xceiver count just to
unblock execution of your program temporarily.

Thanks,
-Idris

On Fri, Jan 27, 2012 at 10:39 AM, Harsh J <harsh@cloudera.com> wrote:

> You are technically allowing DN to run 1 million block transfer
> (in/out) threads by doing that. It does not take up resources by
> default sure, but now it can be abused with requests to make your DN
> run out of memory and crash cause its not bound to proper limits now.
>
> On Fri, Jan 27, 2012 at 5:49 AM, Mark question <markq2011@gmail.com>
> wrote:
> > Harsh, could you explain briefly why is 1M setting for xceiver is bad?
> the
> > job is working now ...
> > about the ulimit -u it shows  200703, so is that why connection is reset
> by
> > peer? How come it's working with the xceiver modification?
> >
> > Thanks,
> > Mark
> >
> >
> > On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <harsh@cloudera.com> wrote:
> >
> >> Agree with Raj V here - Your problem should not be the # of transfer
> >> threads nor the number of open files given that stacktrace.
> >>
> >> And the values you've set for the transfer threads are far beyond
> >> recommendations of 4k/8k - I would not recommend doing that. Default
> >> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
> >> when noticing increased HDFS load, or when running services like
> >> HBase.
> >>
> >> You should instead focus on why its this particular job (or even
> >> particular task, which is important to notice) that fails, and not
> >> other jobs (or other task attempts).
> >>
> >> On Fri, Jan 27, 2012 at 1:10 AM, Raj V <rajvish@yahoo.com> wrote:
> >> > Mark
> >> >
> >> > You have this "Connection reset by peer". Why do you think this
> problem
> >> is related to too many open files?
> >> >
> >> > Raj
> >> >
> >> >
> >> >
> >> >>________________________________
> >> >> From: Mark question <markq2011@gmail.com>
> >> >>To: common-user@hadoop.apache.org
> >> >>Sent: Thursday, January 26, 2012 11:10 AM
> >> >>Subject: Re: Too many open files Error
> >> >>
> >> >>Hi again,
> >> >>I've tried :
> >> >>     <property>
> >> >>        <name>dfs.datanode.max.xcievers</name>
> >> >>        <value>1048576</value>
> >> >>      </property>
> >> >>but I'm still getting the same error ... how high can I go??
> >> >>
> >> >>Thanks,
> >> >>Mark
> >> >>
> >> >>
> >> >>
> >> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <markq2011@gmail.com>
> >> wrote:
> >> >>
> >> >>> Thanks for the reply.... I have nothing about
> >> dfs.datanode.max.xceivers on
> >> >>> my hdfs-site.xml so hopefully this would solve the problem and
about
> >> the
> >> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start
> >> Hadoop
> >> >>> with a single bin/start-all.sh ... Do you think I can add it by
> >> >>> bin/Datanode -ulimit n ?
> >> >>>
> >> >>> Mark
> >> >>>
> >> >>>
> >> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <
> mapred.learn@gmail.com
> >> >wrote:
> >> >>>
> >> >>>> U need to set ulimit -n <bigger value> on datanode and
restart
> >> datanodes.
> >> >>>>
> >> >>>> Sent from my iPhone
> >> >>>>
> >> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <psychidris@gmail.com>
> wrote:
> >> >>>>
> >> >>>> > Hi Mark,
> >> >>>> >
> >> >>>> > On a lighter note what is the count of xceivers?
> >> >>>> dfs.datanode.max.xceivers
> >> >>>> > property in hdfs-site.xml?
> >> >>>> >
> >> >>>> > Thanks,
> >> >>>> > -idris
> >> >>>> >
> >> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
> >> >>>> michael_segel@hotmail.com>wrote:
> >> >>>> >
> >> >>>> >> Sorry going from memory...
> >> >>>> >> As user Hadoop or mapred or hdfs what do you see when
you do a
> >> ulimit
> >> >>>> -a?
> >> >>>> >> That should give you the number of open files allowed
by a
> single
> >> >>>> user...
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> Sent from a remote device. Please excuse any typos...
> >> >>>> >>
> >> >>>> >> Mike Segel
> >> >>>> >>
> >> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question <markq2011@gmail.com
> >
> >> >>>> wrote:
> >> >>>> >>
> >> >>>> >>> Hi guys,
> >> >>>> >>>
> >> >>>> >>>  I get this error from a job trying to process
3Million
> records.
> >> >>>> >>>
> >> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink
> >> >>>> >> 192.168.1.20:50010
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
> >> >>>> >>>
> >> >>>> >>> When I checked the logfile of the datanode-20,
I see :
> >> >>>> >>>
> >> >>>> >>> 2012-01-26 03:00:11,827 ERROR
> >> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(
> >> >>>> >>> 192.168.1.20:50010,
> >> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
> >> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver
> >> >>>> >>> java.io.IOException: Connection reset by peer
> >> >>>> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
> >> >>>> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >> >>>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
> >> >>>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
> >> >>>> >>>   at
> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>>
> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>>
> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >> >>>> >>>   at
> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >> >>>> >>>   at
> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >> >>>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
> >> >>>> >>>   at
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
> >> >>>> >>>   at java.lang.Thread.run(Thread.java:662)
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>> Which is because I'm running 10 maps per taskTracker
on a 20
> node
> >> >>>> >> cluster,
> >> >>>> >>> each map opens about 300 files so that should
give 6000 opened
> >> files
> >> >>>> at
> >> >>>> >> the
> >> >>>> >>> same time ... why is this a problem? the maximum
# of files per
> >> >>>> process
> >> >>>> >> on
> >> >>>> >>> one machine is:
> >> >>>> >>>
> >> >>>> >>> cat /proc/sys/fs/file-max   ---> 2403545
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>> Any suggestions?
> >> >>>> >>>
> >> >>>> >>> Thanks,
> >> >>>> >>> Mark
> >> >>>> >>
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> Customer Ops. Engineer, Cloudera
> >>
>
>
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message