Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 3645 invoked from network); 26 Feb 2008 07:55:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Feb 2008 07:55:47 -0000 Received: (qmail 38083 invoked by uid 500); 26 Feb 2008 07:55:41 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 37895 invoked by uid 500); 26 Feb 2008 07:55:40 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 37884 invoked by uid 99); 26 Feb 2008 07:55:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Feb 2008 23:55:40 -0800 X-ASF-Spam-Status: No, hits=-6.0 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wtaeger@epo.org designates 145.64.132.100 as permitted sender) Received: from [145.64.132.100] (HELO gvmail01.epo.nl) (145.64.132.100) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Feb 2008 07:55:03 +0000 Received: from localhost (localhost [127.0.0.1]) by gvmail01.epo.nl (Postfix) with ESMTP id 1E9A0A1277 for ; Tue, 26 Feb 2008 08:55:11 +0100 (CET) X-Virus-Scanned: scanned at epo.org Received: from mail22.internal.epo.org (mail22.internal.epo.org [10.20.1.39]) by gvmail01.epo.nl (Postfix) with ESMTP id 32781A1272 for ; Tue, 26 Feb 2008 08:55:09 +0100 (CET) In-Reply-To: To: general@lucene.apache.org Subject: Re: Lucene - Search Optimization Problem MIME-Version: 1.0 X-Mailer: Lotus Notes Release 7.0.2 HF698 July 11, 2007 From: =?ISO-8859-1?Q?Wolfgang_T=E4ger?= Message-ID: Date: Tue, 26 Feb 2008 08:55:07 +0100 X-MIMETrack: Serialize by Router on Mail22/EPO(Release 7.0.3|September 26, 2007) at 26-02-2008 08:55:08, Serialize complete at 26-02-2008 08:55:08 Content-Type: multipart/alternative; boundary="=_alternative 002B8003C12573FB_=" X-Virus-Checked: Checked by ClamAV on apache.org --=_alternative 002B8003C12573FB_= Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Hi Jo=E3o, if you need 10.000 or more hits, this might require 10.000 or more disk=20 accesses. Given the access time of disks, there is probably no way to get=20 significantly faster using Lucene on the same hardware. Either you can organise your data so that it is more local on hard disk=20 (what you probably can't), or you need to use memory with lower access time than hard disks, say more=20 RAM for caching, SSD or other flash drives. You may try a cheap 8GB USB stick with low access time.=20 Another possibility is to use a suitable OS with at least 8GB of RAM. If you do so, please share your results. Best regards, Wolfgang T=E4ger "Jo=E3o Rodrigues" =20 24-02-2008 16:19 Please respond to general@lucene.apache.org To general@lucene.apache.org cc Subject Lucene - Search Optimization Problem Hello all! I've finally got round to setup Lucene 2.3.0 in my two production boxes (Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC=20 compilation methods. Now, I have my application all up and running and.... it's damn slow :( I'm running PyLucene by the way, and I've asked on that list already, being directed here. I have a 6.6GB index, with more than 5.000.000 biomedical abstracts=20 indexed. Each document has two fields: an integer, which I will want to retrieve=20 upon search (the ID of the document, sort of), and an 80 words, stored, tokenized, string, which will be searched upon. So, I insert the query=20 (say, foo bar), it builds previously sort of a "boolean query" with a format=20 such as: 'foo' AND 'bar'. Then it parses it and spits out the results. Problem is, unlike most of the posts I've read, I don't want the first 10=20 or 100 results. I want the first 10.000, or even all of them. I've read an HitCollector is due for this task, but my first search on google got me an expressive "HitCollector is too slow on PyLucene", so, I kind of sorted=20 out that option. It takes minutes to get me the results I need, as it is right now. I'll post the code on pastebin and link it for those who feel in a=20 good mood to read n00b's code and help (see below). I've tracked down the=20 problem to the "doc.get("PMID")" method in the Searcher function. My question is: how can I make my search faster? My index wasn't optimized because it was huge and it was built with GCC. By now, it is probably optimized (I left an optimizer running last night) so, that is taken care of. I've considered threading as well, since I'll perform three different searches per "round". Thing is, I'm pretty green when it comes to programming (I'm a biologist) and I've never understood pretty much how threading works. If someone can point me to the right tutorial or documentation, I'd be glad enough to hack it up myself. Another option=20 I've been given was to use an implementation of Lucene written in either C# or C++. However, Lucene.net isn't up to date, and=20 neither is CLucene.. So, if you think you can give out a tip on how to make my script run=20 faster, I'd thank you more than a lot. It's a shame that my project fails because=20 of this technical handicap :( LINKS: http://pastebin.com/m6c384ede -> Main Code http://pastebin.com/m3484ebfc --> Searcher Functions Best regards to you all, Jo=E3o Rodrigues --=_alternative 002B8003C12573FB_=--