lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Askar Zaidi" <askar.za...@gmail.com>
Subject Re: Fine Tuning Lucene implementation
Date Wed, 25 Jul 2007 05:39:54 GMT
Hey Hira ,

Thanks so much for the reply. Much appreciate it.

Quote:

Would it be possible to just include a query clause?
   - i.e., instead of just contents:<userQuery>, also add
+id:<idWeCareAbout>

How can I do that ?

I see my query as :

+contents:harvard +contents:business +contents:review

where the search phrase was: harvard business review

Now how can I add +id:<idWeCareAbout>  ??

This would give me that one exact document I am looking for , for that id. I
don't have to iterate through hits.

thanks,

Askar



On 7/24/07, N. Hira <nhira@cognocys.com> wrote:
>
> I'm no expert on this (so please accept the comments in that context)
> but 2 things seem weird to me:
>
> 1.  Iterating over each hit is an expensive proposition.  I've often
> seen people recommending a HitCollector.
>
> 2.  It seems that doBodySearch() is essentially saying, do this search
> and return the score pertinent to this ID (using an exhaustive loop).
> Would it be possible to just include a query clause?
>     - i.e., instead of just contents:<userQuery>, also add
> +id:<idWeCareAbout>
>
> In general though, I think your algorithm seems inefficient (if I
> understand it correctly):-- if I want to search for one term among 3 in
> a "collection" of 300 documents (as defined by some external attribute),
> I will wind up executing 300 x 3 searches, and for each search that is
> executed, I will iterate over every Hit, even if I've already found the
> one that I "care about".
>
> What would break if you:
> 1.  Included "creator" in the Lucene index (or, filtered out the Hits
> using a BitSet or something like it)
> 2.  Executed 1 search
> 3.  Collected the results of the first N Hits (where N is some
> reasonable limit, like 100 or 500)
>
> -h
>
>
> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
>
> > Sure.
> >
> >  public float doBodySearch(Searcher searcher,String query, int id){
> >
> >                  try{
> >                                 score = search(searcher, query,id);
> >                      }
> >                       catch(IOException io){}
> >                       catch(ParseException pe){}
> >
> >                       return score;
> >
> >                 }
> >
> >  private float search(Searcher searcher, String queryString, int id)
> > throws ParseException, IOException {
> >
> >         // Build a Query object
> >
> >         QueryParser queryParser = new QueryParser("contents", new
> > KeywordAnalyzer());
> >
> >         queryParser.setDefaultOperator(QueryParser.Operator.AND);
> >
> >         Query query = queryParser.parse(queryString);
> >
> >         // Search for the query
> >
> >         Hits hits = searcher.search(query);
> >         Document doc = null;
> >
> >         // Examine the Hits object to see if there were any matches
> >         int hitCount = hits.length();
> >
> >                 for(int i=0;i<hitCount;i++){
> >                 doc = hits.doc(i);
> >                 String str = doc.get("item");
> >                 int tmp = Integer.parseInt(str);
> >                 if(tmp==id)
> >                 score = hits.score(i);
> >                 }
> >
> >         return score;
> >     }
> >
> > I really need to optimize doBodySearch(...) as this takes the most
> > time.
> >
> > thanks guys,
> > Askar
> >
> >
> > On 7/24/07, N. Hira <nhira@cognocys.com> wrote:
> >
> >         Could you show us the relevant source from doBodySearch()?
> >
> >         -h
> >
> >         On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:
> >         > I ran some tests and it seems that the slowness is from
> >         Lucene calls when I
> >         > do "doBodySearch", if I remove that call, Lucene gives me
> >         results in 5
> >         > seconds. otherwise it takes about 50 seconds.
> >         >
> >         > But I need to do Body search and that field contains lots of
> >         text. The field
> >         > is <contents>. How can I optimize that ?
> >         >
> >         > thanks,
> >         > Askar
> >         >
> >         >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message