lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: Fine Tuning Lucene implementation
Date Wed, 25 Jul 2007 16:34:55 GMT
Yes, you can do that.


On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote:

> Heres what I mean:
>
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields
>
> title:"The Right Way" AND text:go
>
>
> Although, I am not searching for the title "the right way" , I am  
> looking
> for the score by specifying a unique field (itemID).
>
> when I do System.out.println(query);
>
> I get:
>
> +contents:Harvard +contents:Business + contents: Review
>
> Can I just add:
>
> +contents:Harvard +contents:Business + contents: Review  
> +itemID=id       ??
>
> That query would just return one document.
>
> On 7/25/07, Askar Zaidi <askar.zaidi@gmail.com> wrote:
>>
>> Instead of refactoring the code, would there be a way to just  
>> modify the
>> query in each search routine ?
>>
>> Such as, "search contents:<text> and item:<itemID>"; This means it  
>> would
>> just collect the score of that one document whose itemID field =  
>> itemID
>> passed from while( rs.next()).
>>
>> I just need to collect the score of the <itemID> already in the  
>> index.
>>
>> Would there be a way to modify the query ? Add a clause ?
>>
>> thanks,
>> Askar
>>
>>
>> On 7/25/07, Grant Ingersoll <grant.ingersoll@gmail.com> wrote:
>>>
>>> So, you really want a single Lucene score (based on the scores of
>>> your 4 fields) for every itemID, correct?  And this score  
>>> consists of
>>> scoring the title, tag, summary and body against some keywords  
>>> correct?
>>>
>>> Here's what I would do:
>>>
>>> while (rs.next())
>>> {
>>>      doc = getDocument(itemId);  // Get your document, including
>>> contents from your database, no need even to put them in Lucene,
>>> although you could
>>>      add the doc to a MemoryIndex (see contrib/memory)
>>>      Run your 4 searches against that memory index to get your
>>> score.  Even better, combine your query into a single query that
>>> searches all 4 fields at once, then Lucene will combine the score  
>>> for
>>> you
>>> }
>>>
>>> MemoryIndex info can be found at http://lucene.zones.apache.org: 
>>> 8080/
>>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
>>> package-summary.html
>>>
>>> -Grant
>>>
>>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
>>>
>>>> Hi Grant,
>>>>
>>>> Thanks for the response. Heres what I am trying to accomplish:
>>>>
>>>> 1. Iterate over itemID (unique) in the database using one SQL  
>>>> query.
>>>> 2. For every itemID found, run 4 searches on Lucene Index.
>>>> 3. doTagSearch(itemID....) ; collect score
>>>> 4. doTitleSearch(itemID...) ; collect score
>>>> 5. doSummarySearch(itemID...) ; collect score
>>>> 6. doBodySearch(itemID....) ; collect score
>>>>
>>>> These scores are then added and I get a total score for each unique
>>>> item in
>>>> the database.
>>>>
>>>> Lucene Index has: <itemID><tags><title><summary><contents>
>>>>
>>>> So if I am running a body search, I have 92 hits from over 300
>>>> documents for
>>>> a query. I already know my hit with the <itemID> .
>>>>
>>>> For instance, from step (1) if itemID 16 is passed to all the 4
>>>> searches, I
>>>> just need to get the score of the document which has itemID field =
>>>> 16. I
>>>> don't have to iterate over all the hits.
>>>>
>>>> I suppose I have to change my query to look for <contents> where
>>>> itemID=16.
>>>> Can you guide me as to how to do it ?
>>>>
>>>> thanks a ton,
>>>>
>>>> Askar
>>>>
>>>> On 7/25/07, Grant Ingersoll <gsingers@apache.org > wrote:
>>>>>
>>>>> Hi Askar,
>>>>>
>>>>> I suggest we take a step back, and ask the question, what are you
>>>>> trying to accomplish?  That is, what is your application trying to
>>>>> do?  Forget the code, etc. just explain what you want the end  
>>>>> result
>>>>> to be and we can work from there.   Based on what you have  
>>>>> described,
>>>>> I am not sure you need access to the hits.  It seems like you just
>>>>> need to make better queries.
>>>>>
>>>>> Is your itemID a unique identifier?  If yes, then you shouldn't  
>>>>> need
>>>>> to loop over hits at all, as you should only ever have one  
>>>>> result IF
>>>>> your query contains a required term.  Also, if this is the  
>>>>> case, why
>>>>> do you need to do a search at all?  Haven't you already identified
>>>>> the items of interest when you did your select query in the
>>>>> database?  Or is it that you want to score the item based on some
>>>>> terms as well.  If that is the case, there are other ways of doing
>>>>> this and we can discuss them.
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>> I need to know how I can use the HitCollector class ? I am using
>>>>>> Hits and
>>>>>> looping over all the possible document hits (turns out its 92  
>>>>>> times
>>>>>> I am
>>>>>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
>>>>>> HitCollector ? I can't seem to understand how its used.
>>>>>>
>>>>>> thanks a lot,
>>>>>>
>>>>>> Askar
>>>>>>
>>>>>> On 7/25/07, Dmitry <dmitrytkach1@hotmail.com> wrote:
>>>>>>>
>>>>>>> Askar,
>>>>>>> why do you need to add +id:<idWeCareAbout>?
>>>>>>> thanks,
>>>>>>> dt,
>>>>>>> www.ejinz.com
>>>>>>> search engine news forms
>>>>>>> ----- Original Message -----
>>>>>>> From: "Askar Zaidi" <askar.zaidi@gmail.com >
>>>>>>> To: <java-user@lucene.apache.org>; <nhira@cognocys.com>
>>>>>>> Sent: Wednesday, July 25, 2007 12:39 AM
>>>>>>> Subject: Re: Fine Tuning Lucene implementation
>>>>>>>
>>>>>>>
>>>>>>>> Hey Hira ,
>>>>>>>>
>>>>>>>> Thanks so much for the reply. Much appreciate it.
>>>>>>>>
>>>>>>>> Quote:
>>>>>>>>
>>>>>>>> Would it be possible to just include a query clause?
>>>>>>>>   - i.e., instead of just contents:<userQuery>, also
add
>>>>>>>> +id:<idWeCareAbout>
>>>>>>>>
>>>>>>>> How can I do that ?
>>>>>>>>
>>>>>>>> I see my query as :
>>>>>>>>
>>>>>>>> +contents:harvard +contents:business +contents:review
>>>>>>>>
>>>>>>>> where the search phrase was: harvard business review
>>>>>>>>
>>>>>>>> Now how can I add +id:<idWeCareAbout>  ??
>>>>>>>>
>>>>>>>> This would give me that one exact document I am looking 

>>>>>>>> for , for
>>>>>>>> that
>>>>>>> id.
>>>>>>>> I
>>>>>>>> don't have to iterate through hits.
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>>
>>>>>>>> Askar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/24/07, N. Hira < nhira@cognocys.com> wrote:
>>>>>>>>>
>>>>>>>>> I'm no expert on this (so please accept the comments
in that
>>>>>>>>> context)
>>>>>>>>> but 2 things seem weird to me:
>>>>>>>>>
>>>>>>>>> 1.  Iterating over each hit is an expensive proposition.
 I've
>>>>>>>>> often
>>>>>>>>> seen people recommending a HitCollector.
>>>>>>>>>
>>>>>>>>> 2.  It seems that doBodySearch() is essentially saying,
do  
>>>>>>>>> this
>>>>>>>>> search
>>>>>>>>> and return the score pertinent to this ID (using an exhaustive
>>>>>>>>> loop).
>>>>>>>>> Would it be possible to just include a query clause?
>>>>>>>>>     - i.e., instead of just contents:<userQuery>,
also add
>>>>>>>>> +id:<idWeCareAbout>
>>>>>>>>>
>>>>>>>>> In general though, I think your algorithm seems inefficient
 
>>>>>>>>> (if I
>>>>>>>>> understand it correctly):-- if I want to search for one
term
>>>>>>>>> among 3 in
>>>>>>>>> a "collection" of 300 documents (as defined by some external
>>>>>>> attribute),
>>>>>>>>> I will wind up executing 300 x 3 searches, and for each
search
>>>>>>>>> that is
>>>>>>>>> executed, I will iterate over every Hit, even if I've
already
>>>>>>>>> found the
>>>>>>>>> one that I "care about".
>>>>>>>>>
>>>>>>>>> What would break if you:
>>>>>>>>> 1.  Included "creator" in the Lucene index (or, filtered
 
>>>>>>>>> out the
>>>>>>>>> Hits
>>>>>>>>> using a BitSet or something like it)
>>>>>>>>> 2.  Executed 1 search
>>>>>>>>> 3.  Collected the results of the first N Hits (where
N is some
>>>>>>>>> reasonable limit, like 100 or 500)
>>>>>>>>>
>>>>>>>>> -h
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
>>>>>>>>>
>>>>>>>>>> Sure.
>>>>>>>>>>
>>>>>>>>>>  public float doBodySearch(Searcher searcher,String
query,  
>>>>>>>>>> int
>>>>>>>>>> id){
>>>>>>>>>>
>>>>>>>>>>                  try{
>>>>>>>>>>                                 score = search(searcher,
>>>>>>>>>> query,id);
>>>>>>>>>>                      }
>>>>>>>>>>                       catch(IOException io){}
>>>>>>>>>>                       catch(ParseException pe){}
>>>>>>>>>>
>>>>>>>>>>                       return score;
>>>>>>>>>>
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>  private float search(Searcher searcher, String queryString,
>>>>>>>>>> int id)
>>>>>>>>>> throws ParseException, IOException {
>>>>>>>>>>
>>>>>>>>>>         // Build a Query object
>>>>>>>>>>
>>>>>>>>>>         QueryParser queryParser = new QueryParser("contents",
>>>>>>>>>> new
>>>>>>>>>> KeywordAnalyzer());
>>>>>>>>>>
>>>>>>>>>>         queryParser.setDefaultOperator
>>>>>>>>>> ( QueryParser.Operator.AND);
>>>>>>>>>>
>>>>>>>>>>         Query query = queryParser.parse(queryString);
>>>>>>>>>>
>>>>>>>>>>         // Search for the query
>>>>>>>>>>
>>>>>>>>>>         Hits hits = searcher.search(query);
>>>>>>>>>>         Document doc = null;
>>>>>>>>>>
>>>>>>>>>>         // Examine the Hits object to see if there
were any
>>>>>>>>>> matches
>>>>>>>>>>         int hitCount = hits.length();
>>>>>>>>>>
>>>>>>>>>>                 for(int i=0;i<hitCount;i++){
>>>>>>>>>>                 doc = hits.doc(i);
>>>>>>>>>>                 String str = doc.get("item");
>>>>>>>>>>                 int tmp = Integer.parseInt (str);
>>>>>>>>>>                 if(tmp==id)
>>>>>>>>>>                 score = hits.score(i);
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>         return score;
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>> I really need to optimize doBodySearch(...) as this
takes the
>>>>>>>>>> most
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> thanks guys,
>>>>>>>>>> Askar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/24/07, N. Hira <nhira@cognocys.com> wrote:
>>>>>>>>>>
>>>>>>>>>>         Could you show us the relevant source from
>>>>>>>>>> doBodySearch()?
>>>>>>>>>>
>>>>>>>>>>         -h
>>>>>>>>>>
>>>>>>>>>>         On Tue, 2007-07-24 at 19:58 -0400, Askar
Zaidi wrote:
>>>>>>>>>>> I ran some tests and it seems that the slowness
is from
>>>>>>>>>>         Lucene calls when I
>>>>>>>>>>> do "doBodySearch", if I remove that call, Lucene
gives me
>>>>>>>>>>         results in 5
>>>>>>>>>>> seconds. otherwise it takes about 50 seconds.
>>>>>>>>>>>
>>>>>>>>>>> But I need to do Body search and that field contains
lots
>>>>>>> of
>>>>>>>>>>         text. The field
>>>>>>>>>>> is <contents>. How can I optimize that
?
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>> Askar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------

>>>>>>> ---
>>>>>>> --
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> Center for Natural Language Processing
>>>>> http://www.cnlp.org/tech/lucene.asp
>>>>>
>>>>> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
>>>>> LuceneFAQ
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>> ------------------------------------------------------
>>> Grant Ingersoll
>>> http://www.grantingersoll.com/
>>> http://lucene.grantingersoll.com
>>> http://www.paperoftheweek.com/
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message