hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark question <markq2...@gmail.com>
Subject Re: Too many open files Error
Date Fri, 27 Jan 2012 17:54:17 GMT
Hi Harsh and Idris ... so the only drawback for increasing the value of
xcievers is memory issue? In that case then I'll set it to a value that
doesn't fill the memory ...
Thanks,
Mark

On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali <psychidris@gmail.com> wrote:

> Hi Mark,
>
> As Harsh pointed out it is not good idea to increase the Xceiver count to
> arbitrarily higher value, I suggested to increase the xceiver count just to
> unblock execution of your program temporarily.
>
> Thanks,
> -Idris
>
> On Fri, Jan 27, 2012 at 10:39 AM, Harsh J <harsh@cloudera.com> wrote:
>
> > You are technically allowing DN to run 1 million block transfer
> > (in/out) threads by doing that. It does not take up resources by
> > default sure, but now it can be abused with requests to make your DN
> > run out of memory and crash cause its not bound to proper limits now.
> >
> > On Fri, Jan 27, 2012 at 5:49 AM, Mark question <markq2011@gmail.com>
> > wrote:
> > > Harsh, could you explain briefly why is 1M setting for xceiver is bad?
> > the
> > > job is working now ...
> > > about the ulimit -u it shows  200703, so is that why connection is
> reset
> > by
> > > peer? How come it's working with the xceiver modification?
> > >
> > > Thanks,
> > > Mark
> > >
> > >
> > > On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <harsh@cloudera.com> wrote:
> > >
> > >> Agree with Raj V here - Your problem should not be the # of transfer
> > >> threads nor the number of open files given that stacktrace.
> > >>
> > >> And the values you've set for the transfer threads are far beyond
> > >> recommendations of 4k/8k - I would not recommend doing that. Default
> > >> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
> > >> when noticing increased HDFS load, or when running services like
> > >> HBase.
> > >>
> > >> You should instead focus on why its this particular job (or even
> > >> particular task, which is important to notice) that fails, and not
> > >> other jobs (or other task attempts).
> > >>
> > >> On Fri, Jan 27, 2012 at 1:10 AM, Raj V <rajvish@yahoo.com> wrote:
> > >> > Mark
> > >> >
> > >> > You have this "Connection reset by peer". Why do you think this
> > problem
> > >> is related to too many open files?
> > >> >
> > >> > Raj
> > >> >
> > >> >
> > >> >
> > >> >>________________________________
> > >> >> From: Mark question <markq2011@gmail.com>
> > >> >>To: common-user@hadoop.apache.org
> > >> >>Sent: Thursday, January 26, 2012 11:10 AM
> > >> >>Subject: Re: Too many open files Error
> > >> >>
> > >> >>Hi again,
> > >> >>I've tried :
> > >> >>     <property>
> > >> >>        <name>dfs.datanode.max.xcievers</name>
> > >> >>        <value>1048576</value>
> > >> >>      </property>
> > >> >>but I'm still getting the same error ... how high can I go??
> > >> >>
> > >> >>Thanks,
> > >> >>Mark
> > >> >>
> > >> >>
> > >> >>
> > >> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <markq2011@gmail.com
> >
> > >> wrote:
> > >> >>
> > >> >>> Thanks for the reply.... I have nothing about
> > >> dfs.datanode.max.xceivers on
> > >> >>> my hdfs-site.xml so hopefully this would solve the problem
and
> about
> > >> the
> > >> >>> ulimit -n , I'm running on an NFS cluster, so usually I just
start
> > >> Hadoop
> > >> >>> with a single bin/start-all.sh ... Do you think I can add
it by
> > >> >>> bin/Datanode -ulimit n ?
> > >> >>>
> > >> >>> Mark
> > >> >>>
> > >> >>>
> > >> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <
> > mapred.learn@gmail.com
> > >> >wrote:
> > >> >>>
> > >> >>>> U need to set ulimit -n <bigger value> on datanode
and restart
> > >> datanodes.
> > >> >>>>
> > >> >>>> Sent from my iPhone
> > >> >>>>
> > >> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <psychidris@gmail.com>
> > wrote:
> > >> >>>>
> > >> >>>> > Hi Mark,
> > >> >>>> >
> > >> >>>> > On a lighter note what is the count of xceivers?
> > >> >>>> dfs.datanode.max.xceivers
> > >> >>>> > property in hdfs-site.xml?
> > >> >>>> >
> > >> >>>> > Thanks,
> > >> >>>> > -idris
> > >> >>>> >
> > >> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
> > >> >>>> michael_segel@hotmail.com>wrote:
> > >> >>>> >
> > >> >>>> >> Sorry going from memory...
> > >> >>>> >> As user Hadoop or mapred or hdfs what do you
see when you do a
> > >> ulimit
> > >> >>>> -a?
> > >> >>>> >> That should give you the number of open files
allowed by a
> > single
> > >> >>>> user...
> > >> >>>> >>
> > >> >>>> >>
> > >> >>>> >> Sent from a remote device. Please excuse any
typos...
> > >> >>>> >>
> > >> >>>> >> Mike Segel
> > >> >>>> >>
> > >> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question <
> markq2011@gmail.com
> > >
> > >> >>>> wrote:
> > >> >>>> >>
> > >> >>>> >>> Hi guys,
> > >> >>>> >>>
> > >> >>>> >>>  I get this error from a job trying to process
3Million
> > records.
> > >> >>>> >>>
> > >> >>>> >>> java.io.IOException: Bad connect ack with
firstBadLink
> > >> >>>> >> 192.168.1.20:50010
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
> > >> >>>> >>>
> > >> >>>> >>> When I checked the logfile of the datanode-20,
I see :
> > >> >>>> >>>
> > >> >>>> >>> 2012-01-26 03:00:11,827 ERROR
> > >> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> DatanodeRegistration(
> > >> >>>> >>> 192.168.1.20:50010,
> > >> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
> > >> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver
> > >> >>>> >>> java.io.IOException: Connection reset by
peer
> > >> >>>> >>>   at sun.nio.ch.FileDispatcher.read0(Native
Method)
> > >> >>>> >>>   at
> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> > >> >>>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
> > >> >>>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
> > >> >>>> >>>   at
> > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>>
> > >>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>>
> > >>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > >> >>>> >>>   at
> > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > >> >>>> >>>   at
> > >> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > >> >>>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
> > >> >>>> >>>   at
> > >> >>>> >>>
> > >> >>>> >>
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
> > >> >>>> >>>   at java.lang.Thread.run(Thread.java:662)
> > >> >>>> >>>
> > >> >>>> >>>
> > >> >>>> >>> Which is because I'm running 10 maps per
taskTracker on a 20
> > node
> > >> >>>> >> cluster,
> > >> >>>> >>> each map opens about 300 files so that should
give 6000
> opened
> > >> files
> > >> >>>> at
> > >> >>>> >> the
> > >> >>>> >>> same time ... why is this a problem? the
maximum # of files
> per
> > >> >>>> process
> > >> >>>> >> on
> > >> >>>> >>> one machine is:
> > >> >>>> >>>
> > >> >>>> >>> cat /proc/sys/fs/file-max   ---> 2403545
> > >> >>>> >>>
> > >> >>>> >>>
> > >> >>>> >>> Any suggestions?
> > >> >>>> >>>
> > >> >>>> >>> Thanks,
> > >> >>>> >>> Mark
> > >> >>>> >>
> > >> >>>>
> > >> >>>
> > >> >>>
> > >> >>
> > >> >>
> > >> >>
> > >>
> > >>
> > >>
> > >> --
> > >> Harsh J
> > >> Customer Ops. Engineer, Cloudera
> > >>
> >
> >
> >
> > --
> > Harsh J
> > Customer Ops. Engineer, Cloudera
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message