lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris May <chris....@warwick.ac.uk>
Subject Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)
Date Wed, 27 Jul 2005 20:56:24 GMT
Always domain + part of a path e.g.

url:http://blogs.warwick.ac.uk/chrismay/*

or

url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
modules/commonlaw/*

or

url:http://www2.warwick.ac.uk/services/its/*


... and so on. Part of the problem is that we may need to go an  
arbitrary number of levels down the path to get an acceptably small  
set of documents to start from - we couldn't impose a rule that said  
something like 'specify the first 2 directories on the path' (c.f my  
second example). We wouldn't need to query for the same path over  
different domains though (e.g. url:*.warwick.ac.uk/about/* )

thanks

Chris




On 27 Jul 2005, at 21:33, Erik Hatcher wrote:

> Could you give some examples of the types of PrefixQuery's you'd  
> like to use?   Is it always at a granularity of domain and path?   
> Or are you wanting to do a prefix pieces of the domain and path?
>
>     Erik
>
> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>
>
>> First, apologies for what seems to be something of an FAQ.
>>
>> However, I've not been able to find an answer either in LIA or in  
>> the relevant section of the FAQ (http://wiki.apache.org/jakarta- 
>> lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>
>> My setup is as follows: I have an index of a few hundred thousand  
>> web pages. I'd like the be able to construct queries that search  
>> for some arbitrary text within a specified URL. Kind of like  
>> google's syntax
>>
>> searchterm +site:www.foo.com/some/section
>>
>> So, I have the page title & content indexed, and the URL stored as  
>> a keywords field, and I imagined that I'd be able to construct a  
>> query something like this:
>>
>> String[] fields = new String[]  
>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>> Query searchTextQuery = MultiFieldQueryParser.parse 
>> (request.getSearchQuery(), fields, analyzer);
>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>> (DocumentFields.URL, request.getUrlPrefix()));
>> hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
>>
>> However, as soon as the set of documents returned by the  
>> prefixquery is more than a thousand or so, I get a  
>> TooManyClausesException, as you might expect.
>>
>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>> here: I'm already using a Filter, and that's not helping (pace  
>> suggestion 1), I don't think I can reduce the number of terms in  
>> the index, else my URLs wouldn't be unique any more, and  
>> increasing the number of clauses seems like a poor choice from a  
>> scalability point of view - I anticipate queries that could filter  
>> perhaps a hundred thousand documents or so.
>>
>> I'm guessing that it might be possible to do something smart by  
>> splitting the URL up into multiple fields - for example, one for  
>> the host and one for the path, or even one for the host and one  
>> for host+path together - but I'm not clear on exactly how I'd use  
>> the two fields, and how they'd help. Can someone enlighten me?
>>
>> Thanks in advance
>>
>> Chris
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message