Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 24FCA76C4 for ; Tue, 23 Aug 2011 05:18:20 +0000 (UTC) Received: (qmail 63114 invoked by uid 500); 23 Aug 2011 04:57:53 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 61875 invoked by uid 500); 23 Aug 2011 04:57:43 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 61831 invoked by uid 99); 23 Aug 2011 04:57:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Aug 2011 04:57:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of m9suns@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Aug 2011 04:57:09 +0000 Received: by qyk7 with SMTP id 7so3322594qyk.14 for ; Mon, 22 Aug 2011 21:56:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=NYpy8z4QEoIO58al7inGnfnJZna0j97o2SKpjddAJJ8=; b=xz5K7RvkNqhG6VMbnBK6vK9YY6Gd1rjEkMoDOiWd7jkfcSsE9DekL9QN5Xi2K9+0Z5 WF2ncHZo0vbH3pccBs/JOWI2SbZ9yvFhrvfE4MdRjdQXhU35TXXBtbcdQaZSgUueWDqF ytMEq56ZxEAudcS/fr17s8ycoeCjnJI7mUJ80= MIME-Version: 1.0 Received: by 10.142.174.11 with SMTP id w11mr1870812wfe.154.1314075407628; Mon, 22 Aug 2011 21:56:47 -0700 (PDT) Received: by 10.68.64.36 with HTTP; Mon, 22 Aug 2011 21:56:47 -0700 (PDT) In-Reply-To: <1314047424.2265.YahooMailNeo@web65515.mail.ac4.yahoo.com> References: <1314047424.2265.YahooMailNeo@web65515.mail.ac4.yahoo.com> Date: Tue, 23 Aug 2011 12:56:47 +0800 Message-ID: Subject: Re: The number of fd and CLOSE_WAIT keep increasing. From: Xu-Feng Mao To: user@hbase.apache.org, Andrew Purtell Cc: "hbase-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=000e0cd29e4ce13a2104ab2505be X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd29e4ce13a2104ab2505be Content-Type: text/plain; charset=ISO-8859-1 Thanks Andy! cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like graceful_stop.sh. Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply replace the package with our own configuration, right? Thanks and regards, Mao Xu-Feng On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell wrote: > > We are running cdh3u0 hbase/hadoop suites on 28 nodes. > > > For your information, CDHU1 does contain this: > > Author: Eli Collins > Date: Tue Jul 5 16:02:22 2011 -0700 > > HDFS-1836. Thousand of CLOSE_WAIT socket. > > Reason: Bug > Author: Bharath Mundlapudi > Ref: CDH-3200 > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > > > ----- Original Message ----- > > From: Xu-Feng Mao > > To: hbase-user@hadoop.apache.org; user@hbase.apache.org > > Cc: > > Sent: Monday, August 22, 2011 4:58 AM > > Subject: Re: The number of fd and CLOSE_WAIT keep increasing. > > > > On average, we have about 3000 CLOSE_WAIT, while on the three problematic > > regionservers, we have about 30k CLOSE_WAIT. > > We set open files limit to 130k, so it work OK now, but it seems not that > > well. > > > > On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao wrote: > > > >> Hi, > >> > >> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last > Friday, we > >> got three regionservers have > >> opened fd and CLOSE_WAIT kept increasing. > >> > >> It looks like if the lines like > >> > >> ==== > >> 2011-08-22 18:19:01,815 WARN > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region > >> > > > STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240. > >> has too many store files; delaying flush up to 90000ms > >> 2011-08-22 18:19:01,815 WARN > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region > >> > > > STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2. > >> has too many store files; delaying flush up to 90000ms > >> ==== > >> > >> increase, then the the number of opened fds and CLOSE_WAIT increase > >> accordingly. > >> > >> We're not sure if it's kind of fd leak under some unexpected > > circumstance > >> or exceptional path. > >> > >> By netstat -lntp, we found that there're lots of connection like > >> > >> ==== > >> Proto Recv-Q Send-Q Local Address Foreign Address > >> State PID/Program name > >> tcp 65 0 10.150.161.64:23241 10.150.161.64:50010 > >> CLOSE_WAIT 27748/java > >> ==== > >> > >> The connections are keeping in these situation. It seems like some > >> connections to hdfs is in the situation > >> that the hdfs datanode has sent FIN, but regionservers are blocking on > the > >> recv queue, so the fd and CLOSE_WAIT sockets > >> are probably leaked. > >> > >> We also see some logs like > >> ==== > >> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed > to > >> connect to /10.150.161.73:50010, add to deadNodes and continue > >> java.io.IOException: Got error in response to OP_READ_BLOCK self=/ > >> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file > >> > /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 > > for > >> block 2791681537571770744_132142063 > >> at > >> > > > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) > >> at > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) > >> at > >> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) > >> at java.io.DataInputStream.read(DataInputStream.java:132) > >> at > >> > > > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) > >> at > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > >> at > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) > >> at > >> > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) > >> at > >> > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) > >> at > >> > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326) > >> at > >> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927) > >> at > >> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > >> ==== > >> > >> The number is much less than the number of " too many store > > files" WARNs, > >> so this might not the cause of too many > >> fds, but is this dangerous to the whole cluster? > >> > >> Thanks and regards, > >> > >> Mao Xu-Feng > >> > >> > > > --000e0cd29e4ce13a2104ab2505be--