Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 43004 invoked from network); 28 Jul 2005 16:39:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Jul 2005 16:39:29 -0000 Received: (qmail 77283 invoked by uid 500); 28 Jul 2005 16:39:20 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 77259 invoked by uid 500); 28 Jul 2005 16:39:20 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 77228 invoked by uid 99); 28 Jul 2005 16:39:20 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2005 09:39:19 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of chris.may@warwick.ac.uk designates 137.205.128.8 as permitted sender) Received: from [137.205.128.8] (HELO mail-relay-2.warwick.ac.uk) (137.205.128.8) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2005 09:39:08 -0700 Received: from localhost (localhost [127.0.0.1]) by mail-relay-2.csv.warwick.ac.uk (8.12.11/8.12.9) with ESMTP id j6SGdDvm023585 for ; Thu, 28 Jul 2005 17:39:13 +0100 (BST) Received: from mail-relay-2.csv.warwick.ac.uk ([127.0.0.1]) by localhost (campion [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 20252-12-2 for ; Thu, 28 Jul 2005 17:39:05 +0100 (BST) Received: from mail.csv.warwick.ac.uk (root@mail [137.205.128.10]) by mail-relay-2.csv.warwick.ac.uk (8.12.11/8.12.9) with ESMTP id j6SGcHgT023364 for ; Thu, 28 Jul 2005 17:38:17 +0100 (BST) X-Envelope-From: chris.may@warwick.ac.uk Received: from [137.205.194.211] (liathach [137.205.194.211]) by mail.csv.warwick.ac.uk (8.12.10/8.12.10) with ESMTP id j6SGbwbT006494 for ; Thu, 28 Jul 2005 17:37:59 +0100 (BST) Mime-Version: 1.0 (Apple Message framework v733) In-Reply-To: <38EA254E-A1A7-422B-A146-6003BAF2C83A@ganyo.com> References: <557EFEDE-002B-41BA-B787-5DEB5940AF52@ehatchersolutions.com> <9D54CF77-90DD-4F5E-B0CF-40FC921F5E4B@warwick.ac.uk> <38EA254E-A1A7-422B-A146-6003BAF2C83A@ganyo.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Chris May Subject: Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...) Date: Thu, 28 Jul 2005 17:37:53 +0100 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.733) X-Virus-Scanned: by amavisd-new X-Virus-Scanned: amavisd-new at warwick.ac.uk X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Works beautifully (at least on my 30K-document test index ). I'll need to do some fiddling if I want to allow partial URLs (i.e. http:// www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/about) but I can see how to do that, I think (and I'm not sure I need it anyway). Thanks Scott! Incidentally, is there an easy way to make QueryParser not treat the colon in 'http://' as a term separator? It seems that URLS get broken into two chunks ('http' and 'www.warwick.ac.uk/somewhere') before they get fed to my custom analyzer. I got round it by just constructing the PhraseQuery by hand, but I wonder if there's an easier way ? Chris On 28 Jul 2005, at 02:02, Scott Ganyo wrote: > Chris, > > How about indexing the domain as one field and each part of the > path as separate terms in another field? I'm sure you've probably > already thought of doing this... and maybe discarded the idea > because you'd lose the position information. However, even though > you can't just simply split the URL on '/' and shove it into the > field, you can add the position information back into the term and > then put it into the field. Then, you would be able to completely > ditch the prefix query and still retrieve the documents using the > entire, ordered path in (I think) the most efficient way possible. > > For example: > > http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ > modules/commonlaw/ > > becomes something like (using n/*** to identify the position): > > domain: www2.warwick.ac.uk > path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/ > modules, 8/commonlaw > > And you could search based on any prefix you desired. For example > searching for this: > > http://www2.warwick.ac.uk/fac/soc/law/* > > would end up being a Lucene search that looks something like this > (note: not query parser syntax!): > > domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND > path: 3/law > > Does that make sense? Would it work for you? > > S > > On Jul 27, 2005, at 3:56 PM, Chris May wrote: > > >> Always domain + part of a path e.g. >> >> url:http://blogs.warwick.ac.uk/chrismay/* >> >> or >> >> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ >> modules/commonlaw/* >> >> or >> >> url:http://www2.warwick.ac.uk/services/its/* >> >> >> ... and so on. Part of the problem is that we may need to go an >> arbitrary number of levels down the path to get an acceptably >> small set of documents to start from - we couldn't impose a rule >> that said something like 'specify the first 2 directories on the >> path' (c.f my second example). We wouldn't need to query for the >> same path over different domains though (e.g. url:*.warwick.ac.uk/ >> about/* ) >> >> thanks >> >> Chris >> >> >> >> >> On 27 Jul 2005, at 21:33, Erik Hatcher wrote: >> >> >> >>> Could you give some examples of the types of PrefixQuery's you'd >>> like to use? Is it always at a granularity of domain and path? >>> Or are you wanting to do a prefix pieces of the domain and path? >>> >>> Erik >>> >>> On Jul 27, 2005, at 3:47 PM, Chris May wrote: >>> >>> >>> >>> >>>> First, apologies for what seems to be something of an FAQ. >>>> >>>> However, I've not been able to find an answer either in LIA or >>>> in the relevant section of the FAQ (http://wiki.apache.org/ >>>> jakarta-lucene/ >>>> LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831) >>>> >>>> My setup is as follows: I have an index of a few hundred >>>> thousand web pages. I'd like the be able to construct queries >>>> that search for some arbitrary text within a specified URL. Kind >>>> of like google's syntax >>>> >>>> searchterm +site:www.foo.com/some/section >>>> >>>> So, I have the page title & content indexed, and the URL stored >>>> as a keywords field, and I imagined that I'd be able to >>>> construct a query something like this: >>>> >>>> String[] fields = new String[] >>>> {DocumentFields.TITLE,DocumentFields.CONTENT}; >>>> Query searchTextQuery = MultiFieldQueryParser.parse >>>> (request.getSearchQuery(), fields, analyzer); >>>> PrefixQuery urlPrefix = new PrefixQuery(new Term >>>> (DocumentFields.URL, request.getUrlPrefix())); >>>> hits = searcher.search(searchTextQuery, new QueryFilter >>>> (urlPrefix)); >>>> >>>> However, as soon as the set of documents returned by the >>>> prefixquery is more than a thousand or so, I get a >>>> TooManyClausesException, as you might expect. >>>> >>>> AFAICS the solutions suggested in the FAQ don't seem to apply >>>> here: I'm already using a Filter, and that's not helping (pace >>>> suggestion 1), I don't think I can reduce the number of terms in >>>> the index, else my URLs wouldn't be unique any more, and >>>> increasing the number of clauses seems like a poor choice from a >>>> scalability point of view - I anticipate queries that could >>>> filter perhaps a hundred thousand documents or so. >>>> >>>> I'm guessing that it might be possible to do something smart by >>>> splitting the URL up into multiple fields - for example, one for >>>> the host and one for the path, or even one for the host and one >>>> for host+path together - but I'm not clear on exactly how I'd >>>> use the two fields, and how they'd help. Can someone enlighten me? >>>> >>>> Thanks in advance >>>> >>>> Chris >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------- >>>> -- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> >>>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org