lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris May <chris....@warwick.ac.uk>
Subject Searching a URL with a PrefixQuery / Too Many Clauses (again...)
Date Wed, 27 Jul 2005 19:47:01 GMT
First, apologies for what seems to be something of an FAQ.

However, I've not been able to find an answer either in LIA or in the  
relevant section of the FAQ (http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)

My setup is as follows: I have an index of a few hundred thousand web  
pages. I'd like the be able to construct queries that search for some  
arbitrary text within a specified URL. Kind of like google's syntax

searchterm +site:www.foo.com/some/section

So, I have the page title & content indexed, and the URL stored as a  
keywords field, and I imagined that I'd be able to construct a query  
something like this:

String[] fields = new String[]  
{DocumentFields.TITLE,DocumentFields.CONTENT};
Query searchTextQuery = MultiFieldQueryParser.parse 
(request.getSearchQuery(), fields, analyzer);
PrefixQuery urlPrefix = new PrefixQuery(new Term(DocumentFields.URL,  
request.getUrlPrefix()));
hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));

However, as soon as the set of documents returned by the prefixquery  
is more than a thousand or so, I get a TooManyClausesException, as  
you might expect.

AFAICS the solutions suggested in the FAQ don't seem to apply here:  
I'm already using a Filter, and that's not helping (pace suggestion  
1), I don't think I can reduce the number of terms in the index, else  
my URLs wouldn't be unique any more, and increasing the number of  
clauses seems like a poor choice from a scalability point of view - I  
anticipate queries that could filter perhaps a hundred thousand  
documents or so.

I'm guessing that it might be possible to do something smart by  
splitting the URL up into multiple fields - for example, one for the  
host and one for the path, or even one for the host and one for host 
+path together - but I'm not clear on exactly how I'd use the two  
fields, and how they'd help. Can someone enlighten me?

Thanks in advance

Chris





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message