hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Too many open files Error
Date Fri, 27 Jan 2012 05:09:59 GMT
You are technically allowing DN to run 1 million block transfer
(in/out) threads by doing that. It does not take up resources by
default sure, but now it can be abused with requests to make your DN
run out of memory and crash cause its not bound to proper limits now.

On Fri, Jan 27, 2012 at 5:49 AM, Mark question <markq2011@gmail.com> wrote:
> Harsh, could you explain briefly why is 1M setting for xceiver is bad? the
> job is working now ...
> about the ulimit -u it shows  200703, so is that why connection is reset by
> peer? How come it's working with the xceiver modification?
>
> Thanks,
> Mark
>
>
> On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <harsh@cloudera.com> wrote:
>
>> Agree with Raj V here - Your problem should not be the # of transfer
>> threads nor the number of open files given that stacktrace.
>>
>> And the values you've set for the transfer threads are far beyond
>> recommendations of 4k/8k - I would not recommend doing that. Default
>> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
>> when noticing increased HDFS load, or when running services like
>> HBase.
>>
>> You should instead focus on why its this particular job (or even
>> particular task, which is important to notice) that fails, and not
>> other jobs (or other task attempts).
>>
>> On Fri, Jan 27, 2012 at 1:10 AM, Raj V <rajvish@yahoo.com> wrote:
>> > Mark
>> >
>> > You have this "Connection reset by peer". Why do you think this problem
>> is related to too many open files?
>> >
>> > Raj
>> >
>> >
>> >
>> >>________________________________
>> >> From: Mark question <markq2011@gmail.com>
>> >>To: common-user@hadoop.apache.org
>> >>Sent: Thursday, January 26, 2012 11:10 AM
>> >>Subject: Re: Too many open files Error
>> >>
>> >>Hi again,
>> >>I've tried :
>> >>     <property>
>> >>        <name>dfs.datanode.max.xcievers</name>
>> >>        <value>1048576</value>
>> >>      </property>
>> >>but I'm still getting the same error ... how high can I go??
>> >>
>> >>Thanks,
>> >>Mark
>> >>
>> >>
>> >>
>> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <markq2011@gmail.com>
>> wrote:
>> >>
>> >>> Thanks for the reply.... I have nothing about
>> dfs.datanode.max.xceivers on
>> >>> my hdfs-site.xml so hopefully this would solve the problem and about
>> the
>> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start
>> Hadoop
>> >>> with a single bin/start-all.sh ... Do you think I can add it by
>> >>> bin/Datanode -ulimit n ?
>> >>>
>> >>> Mark
>> >>>
>> >>>
>> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <mapred.learn@gmail.com
>> >wrote:
>> >>>
>> >>>> U need to set ulimit -n <bigger value> on datanode and restart
>> datanodes.
>> >>>>
>> >>>> Sent from my iPhone
>> >>>>
>> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <psychidris@gmail.com>
wrote:
>> >>>>
>> >>>> > Hi Mark,
>> >>>> >
>> >>>> > On a lighter note what is the count of xceivers?
>> >>>> dfs.datanode.max.xceivers
>> >>>> > property in hdfs-site.xml?
>> >>>> >
>> >>>> > Thanks,
>> >>>> > -idris
>> >>>> >
>> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
>> >>>> michael_segel@hotmail.com>wrote:
>> >>>> >
>> >>>> >> Sorry going from memory...
>> >>>> >> As user Hadoop or mapred or hdfs what do you see when you
do a
>> ulimit
>> >>>> -a?
>> >>>> >> That should give you the number of open files allowed by
a single
>> >>>> user...
>> >>>> >>
>> >>>> >>
>> >>>> >> Sent from a remote device. Please excuse any typos...
>> >>>> >>
>> >>>> >> Mike Segel
>> >>>> >>
>> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question <markq2011@gmail.com>
>> >>>> wrote:
>> >>>> >>
>> >>>> >>> Hi guys,
>> >>>> >>>
>> >>>> >>>  I get this error from a job trying to process 3Million
records.
>> >>>> >>>
>> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink
>> >>>> >> 192.168.1.20:50010
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>> >>>> >>>
>> >>>> >>> When I checked the logfile of the datanode-20, I see
:
>> >>>> >>>
>> >>>> >>> 2012-01-26 03:00:11,827 ERROR
>> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(
>> >>>> >>> 192.168.1.20:50010,
>> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
>> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver
>> >>>> >>> java.io.IOException: Connection reset by peer
>> >>>> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>> >>>> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> >>>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>> >>>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>> >>>> >>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>> >>>> >>>   at
>> >>>> >>>
>> >>>>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >>>> >>>   at
>> >>>> >>>
>> >>>>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >>>> >>>   at
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >>>> >>>   at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>> >>>> >>>   at java.lang.Thread.run(Thread.java:662)
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Which is because I'm running 10 maps per taskTracker
on a 20 node
>> >>>> >> cluster,
>> >>>> >>> each map opens about 300 files so that should give
6000 opened
>> files
>> >>>> at
>> >>>> >> the
>> >>>> >>> same time ... why is this a problem? the maximum #
of files per
>> >>>> process
>> >>>> >> on
>> >>>> >>> one machine is:
>> >>>> >>>
>> >>>> >>> cat /proc/sys/fs/file-max   ---> 2403545
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Any suggestions?
>> >>>> >>>
>> >>>> >>> Thanks,
>> >>>> >>> Mark
>> >>>> >>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera
>>



-- 
Harsh J
Customer Ops. Engineer, Cloudera

Mime
View raw message