Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 60251 invoked from network); 25 Jul 2007 16:35:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Jul 2007 16:35:43 -0000 Received: (qmail 31425 invoked by uid 500); 25 Jul 2007 16:35:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31371 invoked by uid 500); 25 Jul 2007 16:35:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31332 invoked by uid 99); 25 Jul 2007 16:35:29 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jul 2007 09:35:29 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of grant.ingersoll@gmail.com designates 66.249.82.238 as permitted sender) Received: from [66.249.82.238] (HELO wx-out-0506.google.com) (66.249.82.238) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jul 2007 09:35:27 -0700 Received: by wx-out-0506.google.com with SMTP id i28so194500wxd for ; Wed, 25 Jul 2007 09:35:06 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; b=Wyf+sXwMz0G6cUoG+sHN6Ya5499sL2az/FiZztjx3umV7C8KLcKDtsuCS3dNEoXr3bHo8cOT/4dEwdeEoZK/pQMmIkeWgBGfxlLIVtj63waUVMGrEX7ghTd+zEzNuEwpS9Wu7pTo1QR57aMbJFYQRhPXd5MLEuL6VT1poWH3aGg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; b=F5KfDrYvHoyiWX229xN+5oJzmn02HVDSv9hv37UCdeLD39PJOgeR7bCnKSdaKJgJu/yeSyPz4p1rUagzbBPiQ29kvIauvQLffNqlJxEe+YgRB/XoLhUeJMOUAOJP/aD5I1qgneyh8gW+eZ0Gd/xJzocJrSWr4JNhoGGkqTZky1c= Received: by 10.90.73.7 with SMTP id v7mr728402aga.1185381305758; Wed, 25 Jul 2007 09:35:05 -0700 (PDT) Received: from ?192.168.0.3? ( [74.229.189.244]) by mx.google.com with ESMTPS id g9sm1288142wra.2007.07.25.09.35.04 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 25 Jul 2007 09:35:05 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: References: <1185325363.4593.70.camel@localhost> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Grant Ingersoll Subject: Re: Fine Tuning Lucene implementation Date: Wed, 25 Jul 2007 12:34:55 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.3) X-Virus-Checked: Checked by ClamAV on apache.org Yes, you can do that. On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote: > Heres what I mean: > > http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields > > title:"The Right Way" AND text:go > > > Although, I am not searching for the title "the right way" , I am > looking > for the score by specifying a unique field (itemID). > > when I do System.out.println(query); > > I get: > > +contents:Harvard +contents:Business + contents: Review > > Can I just add: > > +contents:Harvard +contents:Business + contents: Review > +itemID=id ?? > > That query would just return one document. > > On 7/25/07, Askar Zaidi wrote: >> >> Instead of refactoring the code, would there be a way to just >> modify the >> query in each search routine ? >> >> Such as, "search contents: and item:"; This means it >> would >> just collect the score of that one document whose itemID field = >> itemID >> passed from while( rs.next()). >> >> I just need to collect the score of the already in the >> index. >> >> Would there be a way to modify the query ? Add a clause ? >> >> thanks, >> Askar >> >> >> On 7/25/07, Grant Ingersoll wrote: >>> >>> So, you really want a single Lucene score (based on the scores of >>> your 4 fields) for every itemID, correct? And this score >>> consists of >>> scoring the title, tag, summary and body against some keywords >>> correct? >>> >>> Here's what I would do: >>> >>> while (rs.next()) >>> { >>> doc = getDocument(itemId); // Get your document, including >>> contents from your database, no need even to put them in Lucene, >>> although you could >>> add the doc to a MemoryIndex (see contrib/memory) >>> Run your 4 searches against that memory index to get your >>> score. Even better, combine your query into a single query that >>> searches all 4 fields at once, then Lucene will combine the score >>> for >>> you >>> } >>> >>> MemoryIndex info can be found at http://lucene.zones.apache.org: >>> 8080/ >>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ >>> package-summary.html >>> >>> -Grant >>> >>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: >>> >>>> Hi Grant, >>>> >>>> Thanks for the response. Heres what I am trying to accomplish: >>>> >>>> 1. Iterate over itemID (unique) in the database using one SQL >>>> query. >>>> 2. For every itemID found, run 4 searches on Lucene Index. >>>> 3. doTagSearch(itemID....) ; collect score >>>> 4. doTitleSearch(itemID...) ; collect score >>>> 5. doSummarySearch(itemID...) ; collect score >>>> 6. doBodySearch(itemID....) ; collect score >>>> >>>> These scores are then added and I get a total score for each unique >>>> item in >>>> the database. >>>> >>>> Lucene Index has: <summary><contents> >>>> >>>> So if I am running a body search, I have 92 hits from over 300 >>>> documents for >>>> a query. I already know my hit with the <itemID> . >>>> >>>> For instance, from step (1) if itemID 16 is passed to all the 4 >>>> searches, I >>>> just need to get the score of the document which has itemID field = >>>> 16. I >>>> don't have to iterate over all the hits. >>>> >>>> I suppose I have to change my query to look for <contents> where >>>> itemID=16. >>>> Can you guide me as to how to do it ? >>>> >>>> thanks a ton, >>>> >>>> Askar >>>> >>>> On 7/25/07, Grant Ingersoll <gsingers@apache.org > wrote: >>>>> >>>>> Hi Askar, >>>>> >>>>> I suggest we take a step back, and ask the question, what are you >>>>> trying to accomplish? That is, what is your application trying to >>>>> do? Forget the code, etc. just explain what you want the end >>>>> result >>>>> to be and we can work from there. Based on what you have >>>>> described, >>>>> I am not sure you need access to the hits. It seems like you just >>>>> need to make better queries. >>>>> >>>>> Is your itemID a unique identifier? If yes, then you shouldn't >>>>> need >>>>> to loop over hits at all, as you should only ever have one >>>>> result IF >>>>> your query contains a required term. Also, if this is the >>>>> case, why >>>>> do you need to do a search at all? Haven't you already identified >>>>> the items of interest when you did your select query in the >>>>> database? Or is it that you want to score the item based on some >>>>> terms as well. If that is the case, there are other ways of doing >>>>> this and we can discuss them. >>>>> >>>>> -Grant >>>>> >>>>> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: >>>>> >>>>>> Hey Guys, >>>>>> >>>>>> I need to know how I can use the HitCollector class ? I am using >>>>>> Hits and >>>>>> looping over all the possible document hits (turns out its 92 >>>>>> times >>>>>> I am >>>>>> looping; for 300 searches, its 300*92 !!). Can I avoid this using >>>>>> HitCollector ? I can't seem to understand how its used. >>>>>> >>>>>> thanks a lot, >>>>>> >>>>>> Askar >>>>>> >>>>>> On 7/25/07, Dmitry <dmitrytkach1@hotmail.com> wrote: >>>>>>> >>>>>>> Askar, >>>>>>> why do you need to add +id:<idWeCareAbout>? >>>>>>> thanks, >>>>>>> dt, >>>>>>> www.ejinz.com >>>>>>> search engine news forms >>>>>>> ----- Original Message ----- >>>>>>> From: "Askar Zaidi" <askar.zaidi@gmail.com > >>>>>>> To: <java-user@lucene.apache.org>; <nhira@cognocys.com> >>>>>>> Sent: Wednesday, July 25, 2007 12:39 AM >>>>>>> Subject: Re: Fine Tuning Lucene implementation >>>>>>> >>>>>>> >>>>>>>> Hey Hira , >>>>>>>> >>>>>>>> Thanks so much for the reply. Much appreciate it. >>>>>>>> >>>>>>>> Quote: >>>>>>>> >>>>>>>> Would it be possible to just include a query clause? >>>>>>>> - i.e., instead of just contents:<userQuery>, also add >>>>>>>> +id:<idWeCareAbout> >>>>>>>> >>>>>>>> How can I do that ? >>>>>>>> >>>>>>>> I see my query as : >>>>>>>> >>>>>>>> +contents:harvard +contents:business +contents:review >>>>>>>> >>>>>>>> where the search phrase was: harvard business review >>>>>>>> >>>>>>>> Now how can I add +id:<idWeCareAbout> ?? >>>>>>>> >>>>>>>> This would give me that one exact document I am looking >>>>>>>> for , for >>>>>>>> that >>>>>>> id. >>>>>>>> I >>>>>>>> don't have to iterate through hits. >>>>>>>> >>>>>>>> thanks, >>>>>>>> >>>>>>>> Askar >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 7/24/07, N. Hira < nhira@cognocys.com> wrote: >>>>>>>>> >>>>>>>>> I'm no expert on this (so please accept the comments in that >>>>>>>>> context) >>>>>>>>> but 2 things seem weird to me: >>>>>>>>> >>>>>>>>> 1. Iterating over each hit is an expensive proposition. I've >>>>>>>>> often >>>>>>>>> seen people recommending a HitCollector. >>>>>>>>> >>>>>>>>> 2. It seems that doBodySearch() is essentially saying, do >>>>>>>>> this >>>>>>>>> search >>>>>>>>> and return the score pertinent to this ID (using an exhaustive >>>>>>>>> loop). >>>>>>>>> Would it be possible to just include a query clause? >>>>>>>>> - i.e., instead of just contents:<userQuery>, also add >>>>>>>>> +id:<idWeCareAbout> >>>>>>>>> >>>>>>>>> In general though, I think your algorithm seems inefficient >>>>>>>>> (if I >>>>>>>>> understand it correctly):-- if I want to search for one term >>>>>>>>> among 3 in >>>>>>>>> a "collection" of 300 documents (as defined by some external >>>>>>> attribute), >>>>>>>>> I will wind up executing 300 x 3 searches, and for each search >>>>>>>>> that is >>>>>>>>> executed, I will iterate over every Hit, even if I've already >>>>>>>>> found the >>>>>>>>> one that I "care about". >>>>>>>>> >>>>>>>>> What would break if you: >>>>>>>>> 1. Included "creator" in the Lucene index (or, filtered >>>>>>>>> out the >>>>>>>>> Hits >>>>>>>>> using a BitSet or something like it) >>>>>>>>> 2. Executed 1 search >>>>>>>>> 3. Collected the results of the first N Hits (where N is some >>>>>>>>> reasonable limit, like 100 or 500) >>>>>>>>> >>>>>>>>> -h >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote: >>>>>>>>> >>>>>>>>>> Sure. >>>>>>>>>> >>>>>>>>>> public float doBodySearch(Searcher searcher,String query, >>>>>>>>>> int >>>>>>>>>> id){ >>>>>>>>>> >>>>>>>>>> try{ >>>>>>>>>> score = search(searcher, >>>>>>>>>> query,id); >>>>>>>>>> } >>>>>>>>>> catch(IOException io){} >>>>>>>>>> catch(ParseException pe){} >>>>>>>>>> >>>>>>>>>> return score; >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> private float search(Searcher searcher, String queryString, >>>>>>>>>> int id) >>>>>>>>>> throws ParseException, IOException { >>>>>>>>>> >>>>>>>>>> // Build a Query object >>>>>>>>>> >>>>>>>>>> QueryParser queryParser = new QueryParser("contents", >>>>>>>>>> new >>>>>>>>>> KeywordAnalyzer()); >>>>>>>>>> >>>>>>>>>> queryParser.setDefaultOperator >>>>>>>>>> ( QueryParser.Operator.AND); >>>>>>>>>> >>>>>>>>>> Query query = queryParser.parse(queryString); >>>>>>>>>> >>>>>>>>>> // Search for the query >>>>>>>>>> >>>>>>>>>> Hits hits = searcher.search(query); >>>>>>>>>> Document doc = null; >>>>>>>>>> >>>>>>>>>> // Examine the Hits object to see if there were any >>>>>>>>>> matches >>>>>>>>>> int hitCount = hits.length(); >>>>>>>>>> >>>>>>>>>> for(int i=0;i<hitCount;i++){ >>>>>>>>>> doc = hits.doc(i); >>>>>>>>>> String str = doc.get("item"); >>>>>>>>>> int tmp = Integer.parseInt (str); >>>>>>>>>> if(tmp==id) >>>>>>>>>> score = hits.score(i); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> return score; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> I really need to optimize doBodySearch(...) as this takes the >>>>>>>>>> most >>>>>>>>>> time. >>>>>>>>>> >>>>>>>>>> thanks guys, >>>>>>>>>> Askar >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/24/07, N. Hira <nhira@cognocys.com> wrote: >>>>>>>>>> >>>>>>>>>> Could you show us the relevant source from >>>>>>>>>> doBodySearch()? >>>>>>>>>> >>>>>>>>>> -h >>>>>>>>>> >>>>>>>>>> On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote: >>>>>>>>>>> I ran some tests and it seems that the slowness is from >>>>>>>>>> Lucene calls when I >>>>>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me >>>>>>>>>> results in 5 >>>>>>>>>>> seconds. otherwise it takes about 50 seconds. >>>>>>>>>>> >>>>>>>>>>> But I need to do Body search and that field contains lots >>>>>>> of >>>>>>>>>> text. The field >>>>>>>>>>> is <contents>. How can I optimize that ? >>>>>>>>>>> >>>>>>>>>>> thanks, >>>>>>>>>>> Askar >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> ---------------------------------------------------------------- >>>>>>> --- >>>>>>> -- >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user- >>>>>>> help@lucene.apache.org >>>>>>> >>>>>>> >>>>> >>>>> -------------------------- >>>>> Grant Ingersoll >>>>> Center for Natural Language Processing >>>>> http://www.cnlp.org/tech/lucene.asp >>>>> >>>>> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/ >>>>> LuceneFAQ >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------ >>>>> --- >>> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>>> >>> >>> ------------------------------------------------------ >>> Grant Ingersoll >>> http://www.grantingersoll.com/ >>> http://lucene.grantingersoll.com >>> http://www.paperoftheweek.com/ >>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> ------------------------------------------------------ Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org