Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of grant.ingersoll@gmail.com
 designates 66.249.82.238 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer;
        b=F5KfDrYvHoyiWX229xN+5oJzmn02HVDSv9hv37UCdeLD39PJOgeR7bCnKSdaKJgJu/yeSyPz4p1rUagzbBPiQ29kvIauvQLffNqlJxEe+YgRB/XoLhUeJMOUAOJP/aD5I1qgneyh8gW+eZ0Gd/xJzocJrSWr4JNhoGGkqTZky1c=
Mime-Version: 1.0 (Apple Message framework v752.3)
In-Reply-To: <c0c0e1d40707250931t7922d012ma84d2d6561e59c9a@mail.gmail.com>
References: <c0c0e1d40707241243y5ddfd5fai10381873a2d2bd64@mail.gmail.com>
 <c0c0e1d40707241714k4cef2167lcd7a5cec866b8555@mail.gmail.com>
 <1185325363.4593.70.camel@localhost>
 <c0c0e1d40707242239y17296bf3rcb9daf6ab3cd9c2f@mail.gmail.com>
 <BAY114-DAV100592591AD1D725911CE594F10@phx.gbl>
 <c0c0e1d40707250710v5e7e7694obead0b44449507e0@mail.gmail.com>
 <F5C96C89-A4FE-4392-8E2D-92706F4E4923@apache.org>
 <c0c0e1d40707250845s6b91c66ei79e94bacfb292f2c@mail.gmail.com>
 <C31E77DA-C5D3-44C6-8D96-4E7DF1812849@gmail.com>
 <c0c0e1d40707250926i895af09m647e2b9bfce575f2@mail.gmail.com>
 <c0c0e1d40707250931t7922d012ma84d2d6561e59c9a@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <C26EACBD-986A-4319-857B-1502788797F7@gmail.com>
Content-Transfer-Encoding: 7bit
From: Grant Ingersoll <grant.ingersoll@gmail.com>
Subject: Re: Fine Tuning Lucene implementation
Date: Wed, 25 Jul 2007 12:34:55 -0400
To: java-user@lucene.apache.org

Yes, you can do that.


On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote:

> Heres what I mean:
>
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields
>
> title:"The Right Way" AND text:go
>
>
> Although, I am not searching for the title "the right way" , I am  
> looking
> for the score by specifying a unique field (itemID).
>
> when I do System.out.println(query);
>
> I get:
>
> +contents:Harvard +contents:Business + contents: Review
>
> Can I just add:
>
> +contents:Harvard +contents:Business + contents: Review  
> +itemID=id       ??
>
> That query would just return one document.
>
> On 7/25/07, Askar Zaidi <askar.zaidi@gmail.com> wrote:
>>
>> Instead of refactoring the code, would there be a way to just  
>> modify the
>> query in each search routine ?
>>
>> Such as, "search contents:<text> and item:<itemID>"; This means it  
>> would
>> just collect the score of that one document whose itemID field =  
>> itemID
>> passed from while( rs.next()).
>>
>> I just need to collect the score of the <itemID> already in the  
>> index.
>>
>> Would there be a way to modify the query ? Add a clause ?
>>
>> thanks,
>> Askar
>>
>>
>> On 7/25/07, Grant Ingersoll <grant.ingersoll@gmail.com> wrote:
>>>
>>> So, you really want a single Lucene score (based on the scores of
>>> your 4 fields) for every itemID, correct?  And this score  
>>> consists of
>>> scoring the title, tag, summary and body against some keywords  
>>> correct?
>>>
>>> Here's what I would do:
>>>
>>> while (rs.next())
>>> {
>>>      doc = getDocument(itemId);  // Get your document, including
>>> contents from your database, no need even to put them in Lucene,
>>> although you could
>>>      add the doc to a MemoryIndex (see contrib/memory)
>>>      Run your 4 searches against that memory index to get your
>>> score.  Even better, combine your query into a single query that
>>> searches all 4 fields at once, then Lucene will combine the score  
>>> for
>>> you
>>> }
>>>
>>> MemoryIndex info can be found at http://lucene.zones.apache.org: 
>>> 8080/
>>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
>>> package-summary.html
>>>
>>> -Grant
>>>
>>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
>>>
>>>> Hi Grant,
>>>>
>>>> Thanks for the response. Heres what I am trying to accomplish:
>>>>
>>>> 1. Iterate over itemID (unique) in the database using one SQL  
>>>> query.
>>>> 2. For every itemID found, run 4 searches on Lucene Index.
>>>> 3. doTagSearch(itemID....) ; collect score
>>>> 4. doTitleSearch(itemID...) ; collect score
>>>> 5. doSummarySearch(itemID...) ; collect score
>>>> 6. doBodySearch(itemID....) ; collect score
>>>>
>>>> These scores are then added and I get a total score for each unique
>>>> item in
>>>> the database.
>>>>
>>>> Lucene Index has: <itemID><tags><title><summary><contents>
>>>>
>>>> So if I am running a body search, I have 92 hits from over 300
>>>> documents for
>>>> a query. I already know my hit with the <itemID> .
>>>>
>>>> For instance, from step (1) if itemID 16 is passed to all the 4
>>>> searches, I
>>>> just need to get the score of the document which has itemID field =
>>>> 16. I
>>>> don't have to iterate over all the hits.
>>>>
>>>> I suppose I have to change my query to look for <contents> where
>>>> itemID=16.
>>>> Can you guide me as to how to do it ?
>>>>
>>>> thanks a ton,
>>>>
>>>> Askar
>>>>
>>>> On 7/25/07, Grant Ingersoll <gsingers@apache.org > wrote:
>>>>>
>>>>> Hi Askar,
>>>>>
>>>>> I suggest we take a step back, and ask the question, what are you
>>>>> trying to accomplish?  That is, what is your application trying to
>>>>> do?  Forget the code, etc. just explain what you want the end  
>>>>> result
>>>>> to be and we can work from there.   Based on what you have  
>>>>> described,
>>>>> I am not sure you need access to the hits.  It seems like you just
>>>>> need to make better queries.
>>>>>
>>>>> Is your itemID a unique identifier?  If yes, then you shouldn't  
>>>>> need
>>>>> to loop over hits at all, as you should only ever have one  
>>>>> result IF
>>>>> your query contains a required term.  Also, if this is the  
>>>>> case, why
>>>>> do you need to do a search at all?  Haven't you already identified
>>>>> the items of interest when you did your select query in the
>>>>> database?  Or is it that you want to score the item based on some
>>>>> terms as well.  If that is the case, there are other ways of doing
>>>>> this and we can discuss them.
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>> I need to know how I can use the HitCollector class ? I am using
>>>>>> Hits and
>>>>>> looping over all the possible document hits (turns out its 92  
>>>>>> times
>>>>>> I am
>>>>>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
>>>>>> HitCollector ? I can't seem to understand how its used.
>>>>>>
>>>>>> thanks a lot,
>>>>>>
>>>>>> Askar
>>>>>>
>>>>>> On 7/25/07, Dmitry <dmitrytkach1@hotmail.com> wrote:
>>>>>>>
>>>>>>> Askar,
>>>>>>> why do you need to add +id:<idWeCareAbout>?
>>>>>>> thanks,
>>>>>>> dt,
>>>>>>> www.ejinz.com
>>>>>>> search engine news forms
>>>>>>> ----- Original Message -----
>>>>>>> From: "Askar Zaidi" <askar.zaidi@gmail.com >
>>>>>>> To: <java-user@lucene.apache.org>; <nhira@cognocys.com>
>>>>>>> Sent: Wednesday, July 25, 2007 12:39 AM
>>>>>>> Subject: Re: Fine Tuning Lucene implementation
>>>>>>>
>>>>>>>
>>>>>>>> Hey Hira ,
>>>>>>>>
>>>>>>>> Thanks so much for the reply. Much appreciate it.
>>>>>>>>
>>>>>>>> Quote:
>>>>>>>>
>>>>>>>> Would it be possible to just include a query clause?
>>>>>>>>   - i.e., instead of just contents:<userQuery>, also add
>>>>>>>> +id:<idWeCareAbout>
>>>>>>>>
>>>>>>>> How can I do that ?
>>>>>>>>
>>>>>>>> I see my query as :
>>>>>>>>
>>>>>>>> +contents:harvard +contents:business +contents:review
>>>>>>>>
>>>>>>>> where the search phrase was: harvard business review
>>>>>>>>
>>>>>>>> Now how can I add +id:<idWeCareAbout>  ??
>>>>>>>>
>>>>>>>> This would give me that one exact document I am looking  
>>>>>>>> for , for
>>>>>>>> that
>>>>>>> id.
>>>>>>>> I
>>>>>>>> don't have to iterate through hits.
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>>
>>>>>>>> Askar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/24/07, N. Hira < nhira@cognocys.com> wrote:
>>>>>>>>>
>>>>>>>>> I'm no expert on this (so please accept the comments in that
>>>>>>>>> context)
>>>>>>>>> but 2 things seem weird to me:
>>>>>>>>>
>>>>>>>>> 1.  Iterating over each hit is an expensive proposition.  I've
>>>>>>>>> often
>>>>>>>>> seen people recommending a HitCollector.
>>>>>>>>>
>>>>>>>>> 2.  It seems that doBodySearch() is essentially saying, do  
>>>>>>>>> this
>>>>>>>>> search
>>>>>>>>> and return the score pertinent to this ID (using an exhaustive
>>>>>>>>> loop).
>>>>>>>>> Would it be possible to just include a query clause?
>>>>>>>>>     - i.e., instead of just contents:<userQuery>, also add
>>>>>>>>> +id:<idWeCareAbout>
>>>>>>>>>
>>>>>>>>> In general though, I think your algorithm seems inefficient  
>>>>>>>>> (if I
>>>>>>>>> understand it correctly):-- if I want to search for one term
>>>>>>>>> among 3 in
>>>>>>>>> a "collection" of 300 documents (as defined by some external
>>>>>>> attribute),
>>>>>>>>> I will wind up executing 300 x 3 searches, and for each search
>>>>>>>>> that is
>>>>>>>>> executed, I will iterate over every Hit, even if I've already
>>>>>>>>> found the
>>>>>>>>> one that I "care about".
>>>>>>>>>
>>>>>>>>> What would break if you:
>>>>>>>>> 1.  Included "creator" in the Lucene index (or, filtered  
>>>>>>>>> out the
>>>>>>>>> Hits
>>>>>>>>> using a BitSet or something like it)
>>>>>>>>> 2.  Executed 1 search
>>>>>>>>> 3.  Collected the results of the first N Hits (where N is some
>>>>>>>>> reasonable limit, like 100 or 500)
>>>>>>>>>
>>>>>>>>> -h
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
>>>>>>>>>
>>>>>>>>>> Sure.
>>>>>>>>>>
>>>>>>>>>>  public float doBodySearch(Searcher searcher,String query,  
>>>>>>>>>> int
>>>>>>>>>> id){
>>>>>>>>>>
>>>>>>>>>>                  try{
>>>>>>>>>>                                 score = search(searcher,
>>>>>>>>>> query,id);
>>>>>>>>>>                      }
>>>>>>>>>>                       catch(IOException io){}
>>>>>>>>>>                       catch(ParseException pe){}
>>>>>>>>>>
>>>>>>>>>>                       return score;
>>>>>>>>>>
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>  private float search(Searcher searcher, String queryString,
>>>>>>>>>> int id)
>>>>>>>>>> throws ParseException, IOException {
>>>>>>>>>>
>>>>>>>>>>         // Build a Query object
>>>>>>>>>>
>>>>>>>>>>         QueryParser queryParser = new QueryParser("contents",
>>>>>>>>>> new
>>>>>>>>>> KeywordAnalyzer());
>>>>>>>>>>
>>>>>>>>>>         queryParser.setDefaultOperator
>>>>>>>>>> ( QueryParser.Operator.AND);
>>>>>>>>>>
>>>>>>>>>>         Query query = queryParser.parse(queryString);
>>>>>>>>>>
>>>>>>>>>>         // Search for the query
>>>>>>>>>>
>>>>>>>>>>         Hits hits = searcher.search(query);
>>>>>>>>>>         Document doc = null;
>>>>>>>>>>
>>>>>>>>>>         // Examine the Hits object to see if there were any
>>>>>>>>>> matches
>>>>>>>>>>         int hitCount = hits.length();
>>>>>>>>>>
>>>>>>>>>>                 for(int i=0;i<hitCount;i++){
>>>>>>>>>>                 doc = hits.doc(i);
>>>>>>>>>>                 String str = doc.get("item");
>>>>>>>>>>                 int tmp = Integer.parseInt (str);
>>>>>>>>>>                 if(tmp==id)
>>>>>>>>>>                 score = hits.score(i);
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>         return score;
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>> I really need to optimize doBodySearch(...) as this takes the
>>>>>>>>>> most
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> thanks guys,
>>>>>>>>>> Askar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/24/07, N. Hira <nhira@cognocys.com> wrote:
>>>>>>>>>>
>>>>>>>>>>         Could you show us the relevant source from
>>>>>>>>>> doBodySearch()?
>>>>>>>>>>
>>>>>>>>>>         -h
>>>>>>>>>>
>>>>>>>>>>         On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:
>>>>>>>>>>> I ran some tests and it seems that the slowness is from
>>>>>>>>>>         Lucene calls when I
>>>>>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me
>>>>>>>>>>         results in 5
>>>>>>>>>>> seconds. otherwise it takes about 50 seconds.
>>>>>>>>>>>
>>>>>>>>>>> But I need to do Body search and that field contains lots
>>>>>>> of
>>>>>>>>>>         text. The field
>>>>>>>>>>> is <contents>. How can I optimize that ?
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>> Askar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> ---
>>>>>>> --
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> Center for Natural Language Processing
>>>>> http://www.cnlp.org/tech/lucene.asp
>>>>>
>>>>> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
>>>>> LuceneFAQ
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>> ------------------------------------------------------
>>> Grant Ingersoll
>>> http://www.grantingersoll.com/
>>> http://lucene.grantingersoll.com
>>> http://www.paperoftheweek.com/
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org