Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 96787 invoked from network); 13 Jul 2005 23:10:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 13 Jul 2005 23:10:06 -0000 Received: (qmail 48029 invoked by uid 500); 13 Jul 2005 23:10:04 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 48000 invoked by uid 500); 13 Jul 2005 23:10:03 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 47983 invoked by uid 99); 13 Jul 2005 23:10:03 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jul 2005 16:10:03 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [209.86.89.63] (HELO smtpauth03.mail.atl.earthlink.net) (209.86.89.63) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jul 2005 16:10:00 -0700 Received: from [66.245.68.111] (helo=ENGELSSERVER) by smtpauth03.mail.atl.earthlink.net with asmtp (Exim 4.34) id 1DsqMa-0006om-07; Wed, 13 Jul 2005 19:10:00 -0400 Reply-To: From: "Robert Engels" To: Cc: "Nathan Brackett" Subject: RE: Search deadlocking under load Date: Wed, 13 Jul 2005 18:10:21 -0500 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Importance: Normal In-reply-to: <20050713225329.30679.qmail@web31113.mail.mud.yahoo.com> X-ELNK-Trace: 33cbdd8ed9881ca8776432462e451d7bd15d05d9470ff7108cc9eafefd766fa2181d61c733b15020350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c X-Originating-IP: 66.245.68.111 X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I had posted an NioFile and caching system that greatly increases the parallelness of Lucene. Although on some platforms (Windows), the low-level NioChannel is not completely thread-safe so it can still block, although the code has some work-arounds for this problem. You can never achieve "100% parallel", as a thread will block doing disk io at some point in the driver (unless everything is in the disk cache), but even without this case, unless you have the same number of processors as threads, there will always be a "blocking/waiting". If the time to perform a search is greater than the time needed for a certain # of requests per second, you will always generate more threads, unless you limit the number of threads accepting requests in the first place. R -----Original Message----- From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] Sent: Wednesday, July 13, 2005 5:53 PM To: java-dev@lucene.apache.org Cc: Nathan Brackett Subject: RE: Search deadlocking under load This may be better for java-dev@... I've looked at the source of that method, but I don't see a way of removing that synchronized block. Maybe somebody else has ideas, but it looks like the synchronization is there to ensure the file being read is read in fully, without some other thread modifying it "under the reader's feet". Otis --- Nathan Brackett wrote: > Otis, > > After further testing it turns out that the 'deadlock' we're > encountering is > not a deadlock at all, but a result of resin hitting its maximum > number of > allowed threads. We bumped up the max-threads in the config and it > fixed > the problem for a certain amount of load, but we'd much prefer to go > after > the source of the problem, namely: > > As the number of threads hitting lucene increases, contention for > locks > increases, meaning the average response time decreases. This places > us in a > downward spiral of performance because as the incoming number of hits > per > second stays constant, the response time decreases, meaning that the > total > number of threads inside resin doing work will increase. This > problem > compounds itself, escalating the number of threads in resin until we > crash. > > > Admittedly this is a pretty harsh test (~~20 hits per second > triggering > complex searches, which starts fine but then escalates to > 150 > threads as > processing slows down but number of incoming hits per second does > not) > > Our ultimate goal, however, is to have each search be completely and > 100% > parallel. > > The point of contention seems to be the method below: > > FSDirectory.java:486 (class FSInputStream) > > > > protected final void readInternal(byte[] b, int offset, int len) > throws IOException { > synchronized (file) { > long position = getFilePointer(); > if (position != file.position) { > file.seek(position); > file.position = position; > } > int total = 0; > do { > int i = file.read(b, offset+total, len-total); > if (i == -1) > throw new IOException("read past EOF"); > file.position += i; > total += i; > } while (total < len); > } > } > > > > > The threads are usually all lined up to reach this. Why are so many > threads > backed up behind the same instance of FSInputStream.readInternal? > Shouldn't > each search have a different input stream? What would you suggest as > the > best path to achieve 100% parallel searching? Here's a sample of our > thread > dump, you can see 2 threads waiting for the same > FSInputStream$Descriptor > (which is the synchronized(file) above): > > "tcpConnection-8080-11" daemon prio=5 tid=0x08304600 nid=0x8304800 > waiting > for monitor entry [bf494000..bf494d08] > at > org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412) > - waiting to lock <0x2f2b7a38> (a > org.apache.lucene.store.FSInputStream$Descriptor) > at > org.apache.lucene.store.InputStream.refill(InputStream.java:158) > at > org.apache.lucene.store.InputStream.readByte(InputStream.java:43) > at > org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) > at > org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126) > at > org.apache.lucene.search.TermScorer.next(TermScorer.java:55) > at > org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112) > at org.apache.lucene.search.Scorer.score(Scorer.java:37) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) > at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) > at org.apache.lucene.search.Hits.(Hits.java:43) > at org.apache.lucene.search.Searcher.search(Searcher.java:33) > at org.apache.lucene.search.Searcher.search(Searcher.java:27) > at > com.nettemps.search.backend.SingleIndexManager.search(SingleIndexManager.jav > a:335) > at > com.nettemps.search.backend.IndexAccessControl.doSearch(IndexAccessControl.j > ava:100) > > "tcpConnection-8080-10" daemon prio=5 tid=0x08336800 nid=0x8336a00 > waiting > for monitor entry [bf4d5000..bf4d5d08] > at > org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412) > - waiting to lock <0x2f2b7a38> (a > org.apache.lucene.store.FSInputStream$Descriptor) > at > org.apache.lucene.store.InputStream.refill(InputStream.java:158) > at > org.apache.lucene.store.InputStream.readByte(InputStream.java:43) > at > org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) > at > org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126) > at > org.apache.lucene.search.TermScorer.next(TermScorer.java:55) > at > org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112) > at org.apache.lucene.search.Scorer.score(Scorer.java:37) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) > at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) > at org.apache.lucene.search.Hits.(Hits.java:43) > at org.apache.lucene.search.Searcher.search(Searcher.java:33) > at org.apache.lucene.search.Searcher.search(Searcher.java:27) > at > com.nettemps.search.backend.SingleIndexManager.search(SingleIndexManager.jav > a:335) > > -----Original Message----- > From: Nathan Brackett [mailto:nbrackett@net-temps.com] > Sent: Monday, July 11, 2005 5:43 PM > To: java-user@lucene.apache.org > Subject: RE: Search deadlocking under load > > > Thanks for the advice. That ought to reduce contention a bit in that > particular method. > > I've been reviewing a large amount of thread dumps today and I was > wondering > if it's common to see many threads that look like this: > > "tcpConnection-8080-20" daemon prio=5 tid=0x081ba000 nid=0x810ac00 > waiting > for monitor entry [bf24b000..bf24bd20] > at > org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou > ndFileReader.java:216) > - waiting to lock <0x2ee24c48> (a > org.apache.lucene.store.FSInputStream) > > When I get the deadlock situation, I often see a few of these lying > around, > but no matching thread that actually has the lock on 0x2ee24c48 in > the dump. > Is this normal? Not really a thread dump pro. > > > > -----Original Message----- > From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] > Sent: Monday, July 11, 2005 1:57 PM > To: java-user@lucene.apache.org > Subject: RE: Search deadlocking under load > > > Hi Nick, > > Without looking at the source of that method, I'd suggest first > trying > the multifile index format (you can easily convert to it by setting > the > new format on IndexWriter and optimizing it). I'd be interested to > know if this eliminates the problem, or at least makes it harder to > hit. > > Otis > > > --- Nathan Brackett wrote: > > > Hey Otis, > > > > Thanks for the hasty response and apologies for my delayed > response. > > It was > > Friday and time to go :) > > > > The queries we're running are very varied (wildcard, phrase, > normal). > > The > > index is only about a 1/2 gig in size (maybe 250,000 documents). > The > > machine > > is running FreeBSD 5.3 with ~2 gig RAM. > > > > I got a thread dump from right around the time that the process > would > > deadlock and not come back and I noticed that almost all of the > > threads were > > waiting on the same method. Here's what the trace looks like: > (small > > sample > > for the sake of brevity...the real dump is huge) > > > > tcpConnection-8080-32: > > [1] > > > org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal > > (CompoundFileReader.java:217) > > [2] org.apache.lucene.store.InputStream.refill > > (InputStream.java:158) > > [3] org.apache.lucene.store.InputStream.readByte > > (InputStream.java:43) > > [4] org.apache.lucene.store.InputStream.readVInt > > (InputStream.java:83) > > [5] org.apache.lucene.index.SegmentTermDocs.read > > (SegmentTermDocs.java:126) > > [6] org.apache.lucene.search.TermScorer.next (TermScorer.java:55) > > [7] org.apache.lucene.search.BooleanScorer.next > > (BooleanScorer.java:112) > > [8] org.apache.lucene.search.Scorer.score (Scorer.java:37) > > [9] org.apache.lucene.search.IndexSearcher.search > > (IndexSearcher.java:92) > > [10] org.apache.lucene.search.Hits.getMoreDocs (Hits.java:64) > > [11] org.apache.lucene.search.Hits. (Hits.java:43) > > [12] org.apache.lucene.search.Searcher.search (Searcher.java:33) > > [13] org.apache.lucene.search.Searcher.search (Searcher.java:27) > > [14] com.nettemps.search.backend.SingleIndexManager.search > > (SingleIndexManager.java:335) > > [15] com.nettemps.search.backend.IndexAccessControl.doSearch > > (IndexAccessControl.java:100) > > [16] com.nettemps.search.server.SearchServerImpl.searchResumes > > (SearchServerImpl.java:402) > > [17] > > > com.nettemps.search.server.SearchServerReadOnly_Tie.invoke_searchResumes > > (SearchServerReadOnly_Tie.java:93) > > [18] > > com.nettemps.search.server.SearchServerReadOnly_Tie.processingHook > > (SearchServerReadOnly_Tie.java:298) > > [19] com.sun.xml.rpc.server.StreamingHandler.handle > > (StreamingHandler.java:321) > > [20] com.sun.xml.rpc.server.http.JAXRPCServletDelegate.doPost > > (JAXRPCServletDelegate.java:443) > > [21] com.sun.xml.rpc.server.http.JAXRPCServlet.doPost > > (JAXRPCServlet.java:102) > > [22] javax.servlet.http.HttpServlet.service > (HttpServlet.java:165) > > [23] javax.servlet.http.HttpServlet.service > (HttpServlet.java:103) > > [24] com.caucho.server.http.FilterChainServlet.doFilter > > (FilterChainServlet.java:96) > > [25] com.caucho.server.http.Invocation.service > > (Invocation.java:315) > > [26] com.caucho.server.http.CacheInvocation.service > > (CacheInvocation.java:135) > > [27] com.caucho.server.http.HttpRequest.handleRequest > > (HttpRequest.java:253) > > [28] com.caucho.server.http.HttpRequest.handleConnection > > (HttpRequest.java:170) > > [29] com.caucho.server.TcpConnection.run (TcpConnection.java:139) > > [30] java.lang.Thread.run (Thread.java:534) > > > > I took a look at that readInternal method and saw that the > contention > > is > > around an InputStream that I assume reads from the actual index > file > > and > > returns it for use by the method. We are running many threads that > > are > > attempting to do searches at the same time (roughly 30 - 35), so > that > > explains why the search times would go up. > > > > In an attempt to reduce the amount of contention, we synchronized > our > > search > > method (the one that makes the actual call to Lucene's search: [14] > > com.nettemps.search.backend.SingleIndexManager.search > > (SingleIndexManager.java:335)). This also caused the same results > > when hit > > with too many threads. > > > > We're really stuck at this point as to what to try. Any advice? > > > > > > > > -----Original Message----- > > From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] > > Sent: Friday, July 08, 2005 3:40 PM > > To: java-user@lucene.apache.org > > Subject: Re: Search deadlocking under load > > > > > > Nathan, > > > > 3) is the recommended usage. > > Your index is on an NFS share, which means you are searching it > over > > the network. Make it local, and you should see performance > > improvements. Local or remove, it makes sense that searches take > > longer to execute, and the load goes up. Yes, it shouldn't > deadlock. > > You shouldn't need to synchronize access to IndexSearcher. > > When your JVM locks up next time, kill it, get the thread dump, and > > send it to the list, so we can try to remove the bottleneck, if > > that's > > possible. > > > > How many queries/second do you run, and what kinds of queries are > > they, > > how big is your index and what kind of hardware (disks, RAM, CPU) > are > > you using? > > > > Otis > > > > --- Nathan Brackett wrote: > > > > > Hey all, > > > > > > We're looking to use Lucene as the back end to our website and > > we're > > > running > > > into an unusual deadlocking problem. > > > > > > For testing purposes, we're just running one web server (threaded > > > environment) against an index mounted on an NFS share. This > machine > > > performs > > > searches only against this index so it's not being touched. We > have > > > tried a > > > few different models so far: > > > > > > 1) Pooling IndexSearcher objects: Occasionally we would run into > > > OutOfMemory > > > problems as we would not block if a request came through and all > > > IndexSearchers were already checked out, we would just create a > > > temporary > > > one and then dispose of it once it was returned to the pool. > > > > > > 2) Create a new IndexSearcher each time: Every request to search > > > would > > > create an IndexSearcher object. This quickly gave OutOfMemory > > errors, > > > even > > > when we would close them out directly after. > > > > > > 3) Use a global IndexSearcher: This is the model we're working > with > > > now. The > > > model holds up fine under low-moderate load and is, in fact, much > > > faster at > > > searching (probably due to some caching mechanism). Under heavy > > load > > > though, > > > the CPU will spike up to 99% and never come back down until we > kill > > > -9 the > > > process. Also, as you ramp the load, we've discovered that search > > > times go > > > up as well. Searches will generally come back after 40ms, but as > > the > > > load > > > goes up the searches don't come back for up to 20 seconds. > > > > > > We've been attempting to find where the problem is for the last > > week > > > with no > > > luck. Our index is optimized, so there is only one file. Do we > need > > > to > > > synchronize access to the global IndexSearcher so that only one > > > search can > > > run at a time? That poses a bit of a problem as if a particular > > > search takes > > > a long time, all others will wait. This problem does not look > like > > an > > > OutOrMemory error because the memory usage when the spike occurs > is > > > usually > > > in the range of 150meg used with a ceiling of 650meg. Anyone else > > > experiencing any problems like this or have any idea where we > > should > > > be > > > looking? Thanks. > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org