Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of chris.may@warwick.ac.uk
 designates 137.205.128.8 as permitted sender)
Mime-Version: 1.0 (Apple Message framework v733)
In-Reply-To: <38EA254E-A1A7-422B-A146-6003BAF2C83A@ganyo.com>
References: <A7EFFB8D-C7DB-48A9-831F-A934ED012B28@warwick.ac.uk>
 <557EFEDE-002B-41BA-B787-5DEB5940AF52@ehatchersolutions.com>
 <9D54CF77-90DD-4F5E-B0CF-40FC921F5E4B@warwick.ac.uk>
 <38EA254E-A1A7-422B-A146-6003BAF2C83A@ganyo.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <AD4091EA-831D-44F4-92EC-71ED4255E2B2@warwick.ac.uk>
Content-Transfer-Encoding: 7bit
From: Chris May <chris.may@warwick.ac.uk>
Subject: Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)
Date: Thu, 28 Jul 2005 17:37:53 +0100
To: java-user@lucene.apache.org

Works beautifully (at least on my 30K-document test index ). I'll  
need to do some fiddling if I want to allow partial URLs (i.e. http:// 
www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/about) but  
I can see how to do that, I think (and I'm not sure I need it anyway).

  Thanks Scott!

Incidentally, is there an easy way to make QueryParser not treat the  
colon in 'http://' as a term separator? It seems that URLS get broken  
into two chunks ('http' and 'www.warwick.ac.uk/somewhere')  before  
they get fed to my custom analyzer. I got round it by just  
constructing the PhraseQuery by hand,  but I wonder if there's an  
easier way ?

Chris

On 28 Jul 2005, at 02:02, Scott Ganyo wrote:

> Chris,
>
> How about indexing the domain as one field and each part of the  
> path as separate terms in another field?  I'm sure you've probably  
> already thought of doing this... and maybe discarded the idea  
> because you'd lose the position information.  However, even though  
> you can't just simply split the URL on '/' and shove it into the  
> field, you can add the position information back into the term and  
> then put it into the field.  Then, you would be able to completely  
> ditch the prefix query and still retrieve the documents using the  
> entire, ordered path in (I think) the most efficient way possible.
>
> For example:
>
> http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
> modules/commonlaw/
>
> becomes something like (using n/*** to identify the position):
>
> domain: www2.warwick.ac.uk
> path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/ 
> modules, 8/commonlaw
>
> And you could search based on any prefix you desired.  For example  
> searching for this:
>
> http://www2.warwick.ac.uk/fac/soc/law/*
>
> would end up being a Lucene search that looks something like this  
> (note: not query parser syntax!):
>
> domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND  
> path: 3/law
>
> Does that make sense?  Would it work for you?
>
> S
>
> On Jul 27, 2005, at 3:56 PM, Chris May wrote:
>
>
>> Always domain + part of a path e.g.
>>
>> url:http://blogs.warwick.ac.uk/chrismay/*
>>
>> or
>>
>> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
>> modules/commonlaw/*
>>
>> or
>>
>> url:http://www2.warwick.ac.uk/services/its/*
>>
>>
>> ... and so on. Part of the problem is that we may need to go an  
>> arbitrary number of levels down the path to get an acceptably  
>> small set of documents to start from - we couldn't impose a rule  
>> that said something like 'specify the first 2 directories on the  
>> path' (c.f my second example). We wouldn't need to query for the  
>> same path over different domains though (e.g. url:*.warwick.ac.uk/ 
>> about/* )
>>
>> thanks
>>
>> Chris
>>
>>
>>
>>
>> On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
>>
>>
>>
>>> Could you give some examples of the types of PrefixQuery's you'd  
>>> like to use?   Is it always at a granularity of domain and path?   
>>> Or are you wanting to do a prefix pieces of the domain and path?
>>>
>>>     Erik
>>>
>>> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>>>
>>>
>>>
>>>
>>>> First, apologies for what seems to be something of an FAQ.
>>>>
>>>> However, I've not been able to find an answer either in LIA or  
>>>> in the relevant section of the FAQ (http://wiki.apache.org/ 
>>>> jakarta-lucene/ 
>>>> LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>>>
>>>> My setup is as follows: I have an index of a few hundred  
>>>> thousand web pages. I'd like the be able to construct queries  
>>>> that search for some arbitrary text within a specified URL. Kind  
>>>> of like google's syntax
>>>>
>>>> searchterm +site:www.foo.com/some/section
>>>>
>>>> So, I have the page title & content indexed, and the URL stored  
>>>> as a keywords field, and I imagined that I'd be able to  
>>>> construct a query something like this:
>>>>
>>>> String[] fields = new String[]  
>>>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>>>> Query searchTextQuery = MultiFieldQueryParser.parse 
>>>> (request.getSearchQuery(), fields, analyzer);
>>>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>>>> (DocumentFields.URL, request.getUrlPrefix()));
>>>> hits = searcher.search(searchTextQuery, new QueryFilter 
>>>> (urlPrefix));
>>>>
>>>> However, as soon as the set of documents returned by the  
>>>> prefixquery is more than a thousand or so, I get a  
>>>> TooManyClausesException, as you might expect.
>>>>
>>>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>>>> here: I'm already using a Filter, and that's not helping (pace  
>>>> suggestion 1), I don't think I can reduce the number of terms in  
>>>> the index, else my URLs wouldn't be unique any more, and  
>>>> increasing the number of clauses seems like a poor choice from a  
>>>> scalability point of view - I anticipate queries that could  
>>>> filter perhaps a hundred thousand documents or so.
>>>>
>>>> I'm guessing that it might be possible to do something smart by  
>>>> splitting the URL up into multiple fields - for example, one for  
>>>> the host and one for the path, or even one for the host and one  
>>>> for host+path together - but I'm not clear on exactly how I'd  
>>>> use the two fields, and how they'd help. Can someone enlighten me?
>>>>
>>>> Thanks in advance
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org