Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Mime-Version: 1.0 (Apple Message framework v730)
In-Reply-To: <AD4091EA-831D-44F4-92EC-71ED4255E2B2@warwick.ac.uk>
References: <A7EFFB8D-C7DB-48A9-831F-A934ED012B28@warwick.ac.uk>
 <557EFEDE-002B-41BA-B787-5DEB5940AF52@ehatchersolutions.com>
 <9D54CF77-90DD-4F5E-B0CF-40FC921F5E4B@warwick.ac.uk>
 <38EA254E-A1A7-422B-A146-6003BAF2C83A@ganyo.com>
 <AD4091EA-831D-44F4-92EC-71ED4255E2B2@warwick.ac.uk>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <B632A71C-B282-468E-8F3A-281C7FBEF045@ehatchersolutions.com>
Content-Transfer-Encoding: 7bit
From: Erik Hatcher <erik@ehatchersolutions.com>
Subject: Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)
Date: Thu, 28 Jul 2005 13:54:42 -0400
To: java-user@lucene.apache.org


On Jul 28, 2005, at 12:37 PM, Chris May wrote:

> Works beautifully (at least on my 30K-document test index ). I'll  
> need to do some fiddling if I want to allow partial URLs (i.e.  
> http://www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/ 
> about) but I can see how to do that, I think (and I'm not sure I  
> need it anyway).
>
>  Thanks Scott!
>
> Incidentally, is there an easy way to make QueryParser not treat  
> the colon in 'http://' as a term separator? It seems that URLS get  
> broken into two chunks ('http' and 'www.warwick.ac.uk/somewhere')   
> before they get fed to my custom analyzer. I got round it by just  
> constructing the PhraseQuery by hand,  but I wonder if there's an  
> easier way ?

I'm not sure what string you're passing to QP, but the : denotes a  
field selector (such as title:lucene).  There is no easy way for  
QueryParser to deal with that differently - it'd be custom parser at  
that point.  You can backslash escape it \:, but that is probably not  
desirable.  Or you could pre-process the string from the user before  
handing it to QP and escape it under the covers.

     Erik


>
> Chris
>
> On 28 Jul 2005, at 02:02, Scott Ganyo wrote:
>
>
>> Chris,
>>
>> How about indexing the domain as one field and each part of the  
>> path as separate terms in another field?  I'm sure you've probably  
>> already thought of doing this... and maybe discarded the idea  
>> because you'd lose the position information.  However, even though  
>> you can't just simply split the URL on '/' and shove it into the  
>> field, you can add the position information back into the term and  
>> then put it into the field.  Then, you would be able to completely  
>> ditch the prefix query and still retrieve the documents using the  
>> entire, ordered path in (I think) the most efficient way possible.
>>
>> For example:
>>
>> http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
>> modules/commonlaw/
>>
>> becomes something like (using n/*** to identify the position):
>>
>> domain: www2.warwick.ac.uk
>> path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/ 
>> modules, 8/commonlaw
>>
>> And you could search based on any prefix you desired.  For example  
>> searching for this:
>>
>> http://www2.warwick.ac.uk/fac/soc/law/*
>>
>> would end up being a Lucene search that looks something like this  
>> (note: not query parser syntax!):
>>
>> domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND  
>> path: 3/law
>>
>> Does that make sense?  Would it work for you?
>>
>> S
>>
>> On Jul 27, 2005, at 3:56 PM, Chris May wrote:
>>
>>
>>
>>> Always domain + part of a path e.g.
>>>
>>> url:http://blogs.warwick.ac.uk/chrismay/*
>>>
>>> or
>>>
>>> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
>>> modules/commonlaw/*
>>>
>>> or
>>>
>>> url:http://www2.warwick.ac.uk/services/its/*
>>>
>>>
>>> ... and so on. Part of the problem is that we may need to go an  
>>> arbitrary number of levels down the path to get an acceptably  
>>> small set of documents to start from - we couldn't impose a rule  
>>> that said something like 'specify the first 2 directories on the  
>>> path' (c.f my second example). We wouldn't need to query for the  
>>> same path over different domains though (e.g. url:*.warwick.ac.uk/ 
>>> about/* )
>>>
>>> thanks
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>> On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
>>>
>>>
>>>
>>>
>>>> Could you give some examples of the types of PrefixQuery's you'd  
>>>> like to use?   Is it always at a granularity of domain and  
>>>> path?  Or are you wanting to do a prefix pieces of the domain  
>>>> and path?
>>>>
>>>>     Erik
>>>>
>>>> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> First, apologies for what seems to be something of an FAQ.
>>>>>
>>>>> However, I've not been able to find an answer either in LIA or  
>>>>> in the relevant section of the FAQ (http://wiki.apache.org/ 
>>>>> jakarta-lucene/ 
>>>>> LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>>>>
>>>>> My setup is as follows: I have an index of a few hundred  
>>>>> thousand web pages. I'd like the be able to construct queries  
>>>>> that search for some arbitrary text within a specified URL.  
>>>>> Kind of like google's syntax
>>>>>
>>>>> searchterm +site:www.foo.com/some/section
>>>>>
>>>>> So, I have the page title & content indexed, and the URL stored  
>>>>> as a keywords field, and I imagined that I'd be able to  
>>>>> construct a query something like this:
>>>>>
>>>>> String[] fields = new String[]  
>>>>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>>>>> Query searchTextQuery = MultiFieldQueryParser.parse 
>>>>> (request.getSearchQuery(), fields, analyzer);
>>>>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>>>>> (DocumentFields.URL, request.getUrlPrefix()));
>>>>> hits = searcher.search(searchTextQuery, new QueryFilter 
>>>>> (urlPrefix));
>>>>>
>>>>> However, as soon as the set of documents returned by the  
>>>>> prefixquery is more than a thousand or so, I get a  
>>>>> TooManyClausesException, as you might expect.
>>>>>
>>>>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>>>>> here: I'm already using a Filter, and that's not helping (pace  
>>>>> suggestion 1), I don't think I can reduce the number of terms  
>>>>> in the index, else my URLs wouldn't be unique any more, and  
>>>>> increasing the number of clauses seems like a poor choice from  
>>>>> a scalability point of view - I anticipate queries that could  
>>>>> filter perhaps a hundred thousand documents or so.
>>>>>
>>>>> I'm guessing that it might be possible to do something smart by  
>>>>> splitting the URL up into multiple fields - for example, one  
>>>>> for the host and one for the path, or even one for the host and  
>>>>> one for host+path together - but I'm not clear on exactly how  
>>>>> I'd use the two fields, and how they'd help. Can someone  
>>>>> enlighten me?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org