Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 68360 invoked from network); 28 Jul 2005 17:55:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Jul 2005 17:55:49 -0000 Received: (qmail 90474 invoked by uid 500); 28 Jul 2005 17:55:41 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 90454 invoked by uid 500); 28 Jul 2005 17:55:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 90441 invoked by uid 99); 28 Jul 2005 17:55:41 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2005 10:55:41 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2005 10:55:33 -0700 Received: by ehatchersolutions.com (Postfix, from userid 504) id 807E713E2007; Thu, 28 Jul 2005 13:55:35 -0400 (EDT) Received: from [128.143.167.108] (d-128-167-108.bootp.Virginia.EDU [128.143.167.108]) by ehatchersolutions.com (Postfix) with ESMTP id 9CE4413E2006 for ; Thu, 28 Jul 2005 13:54:43 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v730) In-Reply-To: References: <557EFEDE-002B-41BA-B787-5DEB5940AF52@ehatchersolutions.com> <9D54CF77-90DD-4F5E-B0CF-40FC921F5E4B@warwick.ac.uk> <38EA254E-A1A7-422B-A146-6003BAF2C83A@ganyo.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...) Date: Thu, 28 Jul 2005 13:54:42 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.730) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Level: X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-3.3 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Jul 28, 2005, at 12:37 PM, Chris May wrote: > Works beautifully (at least on my 30K-document test index ). I'll > need to do some fiddling if I want to allow partial URLs (i.e. > http://www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/ > about) but I can see how to do that, I think (and I'm not sure I > need it anyway). > > Thanks Scott! > > Incidentally, is there an easy way to make QueryParser not treat > the colon in 'http://' as a term separator? It seems that URLS get > broken into two chunks ('http' and 'www.warwick.ac.uk/somewhere') > before they get fed to my custom analyzer. I got round it by just > constructing the PhraseQuery by hand, but I wonder if there's an > easier way ? I'm not sure what string you're passing to QP, but the : denotes a field selector (such as title:lucene). There is no easy way for QueryParser to deal with that differently - it'd be custom parser at that point. You can backslash escape it \:, but that is probably not desirable. Or you could pre-process the string from the user before handing it to QP and escape it under the covers. Erik > > Chris > > On 28 Jul 2005, at 02:02, Scott Ganyo wrote: > > >> Chris, >> >> How about indexing the domain as one field and each part of the >> path as separate terms in another field? I'm sure you've probably >> already thought of doing this... and maybe discarded the idea >> because you'd lose the position information. However, even though >> you can't just simply split the URL on '/' and shove it into the >> field, you can add the position information back into the term and >> then put it into the field. Then, you would be able to completely >> ditch the prefix query and still retrieve the documents using the >> entire, ordered path in (I think) the most efficient way possible. >> >> For example: >> >> http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ >> modules/commonlaw/ >> >> becomes something like (using n/*** to identify the position): >> >> domain: www2.warwick.ac.uk >> path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/ >> modules, 8/commonlaw >> >> And you could search based on any prefix you desired. For example >> searching for this: >> >> http://www2.warwick.ac.uk/fac/soc/law/* >> >> would end up being a Lucene search that looks something like this >> (note: not query parser syntax!): >> >> domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND >> path: 3/law >> >> Does that make sense? Would it work for you? >> >> S >> >> On Jul 27, 2005, at 3:56 PM, Chris May wrote: >> >> >> >>> Always domain + part of a path e.g. >>> >>> url:http://blogs.warwick.ac.uk/chrismay/* >>> >>> or >>> >>> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ >>> modules/commonlaw/* >>> >>> or >>> >>> url:http://www2.warwick.ac.uk/services/its/* >>> >>> >>> ... and so on. Part of the problem is that we may need to go an >>> arbitrary number of levels down the path to get an acceptably >>> small set of documents to start from - we couldn't impose a rule >>> that said something like 'specify the first 2 directories on the >>> path' (c.f my second example). We wouldn't need to query for the >>> same path over different domains though (e.g. url:*.warwick.ac.uk/ >>> about/* ) >>> >>> thanks >>> >>> Chris >>> >>> >>> >>> >>> On 27 Jul 2005, at 21:33, Erik Hatcher wrote: >>> >>> >>> >>> >>>> Could you give some examples of the types of PrefixQuery's you'd >>>> like to use? Is it always at a granularity of domain and >>>> path? Or are you wanting to do a prefix pieces of the domain >>>> and path? >>>> >>>> Erik >>>> >>>> On Jul 27, 2005, at 3:47 PM, Chris May wrote: >>>> >>>> >>>> >>>> >>>> >>>>> First, apologies for what seems to be something of an FAQ. >>>>> >>>>> However, I've not been able to find an answer either in LIA or >>>>> in the relevant section of the FAQ (http://wiki.apache.org/ >>>>> jakarta-lucene/ >>>>> LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831) >>>>> >>>>> My setup is as follows: I have an index of a few hundred >>>>> thousand web pages. I'd like the be able to construct queries >>>>> that search for some arbitrary text within a specified URL. >>>>> Kind of like google's syntax >>>>> >>>>> searchterm +site:www.foo.com/some/section >>>>> >>>>> So, I have the page title & content indexed, and the URL stored >>>>> as a keywords field, and I imagined that I'd be able to >>>>> construct a query something like this: >>>>> >>>>> String[] fields = new String[] >>>>> {DocumentFields.TITLE,DocumentFields.CONTENT}; >>>>> Query searchTextQuery = MultiFieldQueryParser.parse >>>>> (request.getSearchQuery(), fields, analyzer); >>>>> PrefixQuery urlPrefix = new PrefixQuery(new Term >>>>> (DocumentFields.URL, request.getUrlPrefix())); >>>>> hits = searcher.search(searchTextQuery, new QueryFilter >>>>> (urlPrefix)); >>>>> >>>>> However, as soon as the set of documents returned by the >>>>> prefixquery is more than a thousand or so, I get a >>>>> TooManyClausesException, as you might expect. >>>>> >>>>> AFAICS the solutions suggested in the FAQ don't seem to apply >>>>> here: I'm already using a Filter, and that's not helping (pace >>>>> suggestion 1), I don't think I can reduce the number of terms >>>>> in the index, else my URLs wouldn't be unique any more, and >>>>> increasing the number of clauses seems like a poor choice from >>>>> a scalability point of view - I anticipate queries that could >>>>> filter perhaps a hundred thousand documents or so. >>>>> >>>>> I'm guessing that it might be possible to do something smart by >>>>> splitting the URL up into multiple fields - for example, one >>>>> for the host and one for the path, or even one for the host and >>>>> one for host+path together - but I'm not clear on exactly how >>>>> I'd use the two fields, and how they'd help. Can someone >>>>> enlighten me? >>>>> >>>>> Thanks in advance >>>>> >>>>> Chris >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------ >>>>> --- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------- >>>> -- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org