Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of m9suns@gmail.com designates
 209.85.216.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1314047424.2265.YahooMailNeo@web65515.mail.ac4.yahoo.com>
References: 
 <CACVLn7NE1d9i-Pe2RUJnmNFfhOdTsCh4eL6HrY62Ebtk-3Te1A@mail.gmail.com>
	<CACVLn7Mpy1wa_ea1CgTsq6djX3hDstDwoQ7BDHxiYnjDG33FQA@mail.gmail.com>
	<1314047424.2265.YahooMailNeo@web65515.mail.ac4.yahoo.com>
Date: Tue, 23 Aug 2011 12:56:47 +0800
Message-ID: 
 <CACVLn7P2EfYOk=ZSiCuKasDKDbLE0=1-9BJqzJibrjJrE2HcMw@mail.gmail.com>
Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
From: Xu-Feng Mao <m9suns@gmail.com>
To: user@hbase.apache.org, Andrew Purtell <apurtell@apache.org>
Cc: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=000e0cd29e4ce13a2104ab2505be

--000e0cd29e4ce13a2104ab2505be
Content-Type: text/plain; charset=ISO-8859-1

Thanks Andy!

cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like
graceful_stop.sh.
Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply
replace the package
with our own configuration, right?

Thanks and regards,

Mao Xu-Feng

On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell <apurtell@apache.org> wrote:

> > We are running cdh3u0 hbase/hadoop suites on 28 nodes.
>
>
> For your information, CDHU1 does contain this:
>
>   Author: Eli Collins <eli@cloudera.com>
>   Date:   Tue Jul 5 16:02:22 2011 -0700
>
>       HDFS-1836. Thousand of CLOSE_WAIT socket.
>
>       Reason: Bug
>       Author: Bharath Mundlapudi
>       Ref: CDH-3200
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> ----- Original Message -----
> > From: Xu-Feng Mao <m9suns@gmail.com>
> > To: hbase-user@hadoop.apache.org; user@hbase.apache.org
> > Cc:
> > Sent: Monday, August 22, 2011 4:58 AM
> > Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
> >
> > On average, we have about 3000 CLOSE_WAIT, while on the three problematic
> > regionservers, we have about 30k CLOSE_WAIT.
> > We set open files limit to 130k, so it work OK now, but it seems not that
> > well.
> >
> > On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <m9suns@gmail.com> wrote:
> >
> >>  Hi,
> >>
> >>  We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last
> Friday, we
> >>  got three regionservers have
> >>  opened fd and CLOSE_WAIT kept increasing.
> >>
> >>  It looks like if the lines like
> >>
> >>  ====
> >>  2011-08-22 18:19:01,815 WARN
> >>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> >>
> >
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
> >>  has too many store files; delaying flush up to 90000ms
> >>  2011-08-22 18:19:01,815 WARN
> >>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> >>
> >
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
> >>  has too many store files; delaying flush up to 90000ms
> >>  ====
> >>
> >>  increase, then the the number of opened fds and CLOSE_WAIT increase
> >>  accordingly.
> >>
> >>  We're not sure if it's kind of fd leak under some unexpected
> > circumstance
> >>  or exceptional path.
> >>
> >>  By netstat -lntp, we found that there're lots of connection like
> >>
> >>  ====
> >>  Proto Recv-Q Send-Q Local Address               Foreign Address
> >>  State       PID/Program name
> >>  tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
> >>    CLOSE_WAIT  27748/java
> >>  ====
> >>
> >>  The connections are keeping in these situation. It seems like some
> >>  connections to hdfs is in the situation
> >>  that the hdfs datanode has sent FIN, but regionservers are blocking on
> the
> >>  recv queue, so the fd and CLOSE_WAIT sockets
> >>  are probably leaked.
> >>
> >>  We also see some logs like
> >>  ====
> >>  2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed
> to
> >>  connect to /10.150.161.73:50010, add to deadNodes and continue
> >>  java.io.IOException: Got error in response to OP_READ_BLOCK self=/
> >>  10.150.161.64:55229, remote=/10.150.161.73:50010 for file
> >>
>  /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241
> > for
> >>  block 2791681537571770744_132142063
> >>          at
> >>
> >
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
> >>          at
> >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
> >>          at
> >>
>  org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
> >>          at java.io.DataInputStream.read(DataInputStream.java:132)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
> >>          at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>          at
> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>          at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
> >>          at
> >>
>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
> >>          at
> >>
>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
> >>          at
> >>
>  org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
> >>          at
> >>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
> >>          at
> >>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> >>  ====
> >>
> >>  The number is much less than the number of " too many store
> > files" WARNs,
> >>  so this might not the cause of too many
> >>  fds, but is this dangerous to the whole cluster?
> >>
> >>  Thanks and regards,
> >>
> >>  Mao Xu-Feng
> >>
> >>
> >
>

--000e0cd29e4ce13a2104ab2505be--