Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 24112 invoked from network); 6 Dec 2005 02:20:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Dec 2005 02:20:47 -0000 Received: (qmail 64157 invoked by uid 500); 6 Dec 2005 02:20:44 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 64121 invoked by uid 500); 6 Dec 2005 02:20:43 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 64110 invoked by uid 99); 6 Dec 2005 02:20:43 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2005 18:20:43 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2005 18:20:42 -0800 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id 0772F5B829; Mon, 5 Dec 2005 18:20:21 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id 0464F7F466 for ; Mon, 5 Dec 2005 18:20:21 -0800 (PST) Date: Mon, 5 Dec 2005 18:20:20 -0800 (PST) From: Chris Hostetter Sender: hossman@hal.rescomp.berkeley.edu To: java-dev@lucene.apache.org Subject: Re: "Advanced" query language In-Reply-To: Message-ID: References: <43947131.2080007@scalix.com> <200512052118.04225.paul.elschot@xs4all.nl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I'm extremely stoked to see this topic come up, but very sad that I didn't have time to read any Lucene mail this past weekend. I'll have to catchup. First off... : Again, we're talking machine-to-machine communication here, not human- : machine. : While there have been several different topics brought up on this : thread, it seems we're diverging from the original idea. Let's : consider the most basic use case example here, and I'm making it : intentionally as concrete as possible: : : A Swing client performs searches by communicating with a Lucene : search server, which is wrapped by a RESTful servlet. The client : wants to issue sophisticated queries that are not supported by : QueryParser. While I agree that the number one goal of this discussion is and should be a "string representation of a query" which can represent any valid query structure in a reliable manner, there are lots of things that could satisfy that goal, and the fastest or easiest to impliment may not be the best in the context of other use cases or goals that could also be satisified without sacrificing on the primary goal. I can think of at least two big use cases that I'm concerned about.... 1) Human creation While it's true that most "users" just want to type words, the Humans I'm thinking of (and i imagine Paul was thinking as well) aren't users as much as they are Developers of QA Testers who understand the internals of the index, and know aout all the different types of Lucene queries, and want to manually hit that REST servlet with a particular query construct. It's perfectly fine to expect these humans to type XML, but lets not make it extra hard on them. This may map one to one with the internal representation... contentjava contentlucene contentcoffee ...but so does this, and I know which one I'd rather have to type if I were trying to debug something... java lucene coffee Further more, the following may not have a one-to-one corrispondence with the underlying implimentation of BooleanQuery, but that doesn't make it a less correct way of identifying a query, and again, I know which one I'd rather type if i were debugging... java lucene coffee ...I guess the gist of my point is: Just because the generator/parser is expected to be a machine, doesn't mean we can't also make it easy for humans too; nor should it mean we ignore huffman encoding. 2) Aliasing Imagine that we were having this discussion a year or two ago (however long it's been since Span queries were created) and the result was a new XML query parser that used reflection to build up objects based on the XML tags. (Not the silly way where reflection is used everytime, the less silly way where reflection is used the first time, and a Singleton maintains a Map of XMLTag=>Class mappings for future parsing). Lucene Users go out and write their REST servlets that use this, and when their many clients ask them how to generate a query, they give them examples like this... word1 word2 word1 word2 ...and everybody is happy. The Lucene user is a little anoyed one day when he realizes that setting slop="5" generally gives "better" results, because there's nothing he can change on his servlet to make this change for everyone, he's got to contact all of his clients and suggest that they make the change. and then one day SpanQuery gets written, and our Lucene User is drooling, but he's faced with the same problem: He has to tell all of his clients that he recomends using a new query structure because or there's no easy way to drop in SpanNearQuery instead of Phrase query. He briefly considers recompiling lucene jar after changing PhraseQuery to be a thin subclass/wrapper arround SpanNearQuery, but then smacks himself silly when he realizes how hard it would be to maintain that in the future. The moral of this story being that even though it's importnat to have something that can (and by default *does*) provide a one to one mapping of Query<->XML, this mapping shouldn't be implimented in such a way that it prevents people from using it in more complex ways. The idea I've been rolling arround in the back of my head for a few weeks, is to have a generic parsing framework, which can be used with registered "handlers" to parse XML into Queries, each handler being tied to a specific xml tag, and having the ability to pass state information along as child elements are parsed, and take action based on state information passed back up after the child elements are complete. To put it another way: Imagine an API that wraps a SAX parser thin enough that you can still do crazy stateful stuff, but thick enough that the output of handlers must be Query objects, and a reusable handler for each type of Query in the lucene code bases comes with it out of the box. Now including a generic XML<->Query API to distribute with lucene becomes very easy -- it would also alow this API to have a few shortcuts for the really common query types so that simple queries still are somewhat human readable. It's also easy for clients that want to add their own types of queries, or to change the default slop on phrase queries to 5, or map a "phrase" query to a SpanNearQuery object, or add a new tag that just changes the analyzer used by child tags, or a tag that changes the handler registry for dealing with it's children, etc.... Hopefully that makes sense to someone besides just me. It's certainly a lot more complexity then a simple one to one mapping, but it seems to me like the flexability is worth spending the extra time to design/build it. Especially if I can convince Yonik's boss to pay him to do all the hard work. :) -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org