lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Lucene performance bottlenecks
Date Wed, 07 Dec 2005 18:40:08 GMT
Paul Elschot wrote:
> Querying the host field like this in a web page index can be dangerous
> business. For example when term1 is "wikipedia" and term2 is "org",
> the query will match at least all pages from wikipedia.org.

Note that if you search for wikipedia.org in Nutch this is interpreted 
as an implicit phrase query and is expanded differently, as:

+(url:"wikipedia org"^4.0
   anchor:"wikipedia org"^2.0
   content:"wikipedia org"
   title:"wikipedia org"^1.5
   host:"wikipedia org"^2.0)

Note also that Nutch can use N-grams for common terms.  One can thus 
configure things so that this would be instead:

+(url:"wikipedia-org"^4.0
   anchor:"wikipedia org"^2.0
   content:"wikipedia org"
   title:"wikipedia org"^1.5
   host:"wikipedia-org"^2.0)

where wikipedia-org is a bigram term indexed when org occurs in the url 
or host field.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message