Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Date: Mon, 5 Dec 2005 18:20:20 -0800 (PST)
From: Chris Hostetter <hossman_lucene@fucit.org>
Sender: hossman@hal.rescomp.berkeley.edu
To: java-dev@lucene.apache.org
Subject: Re: "Advanced" query language
In-Reply-To: <C582271B-1AFB-4134-A472-ECFB841FB288@ehatchersolutions.com>
Message-ID: <Pine.LNX.4.58.0512051625530.6279@hal.rescomp.berkeley.edu>
References: 
 <B17B1CFF4282214AB8BAADDDC207112205119F00@THHS2EXBE1X.hostedservice2.net>
 <43947131.2080007@scalix.com> <200512052118.04225.paul.elschot@xs4all.nl>
 <C582271B-1AFB-4134-A472-ECFB841FB288@ehatchersolutions.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII


I'm extremely stoked to see this topic come up, but very sad that I didn't
have time to read any Lucene mail this past weekend.  I'll have to
catchup.

First off...

: Again, we're talking machine-to-machine communication here, not human-
: machine.

: While there have been several different topics brought up on this
: thread, it seems we're diverging from the original idea.  Let's
: consider the most basic use case example here, and I'm making it
: intentionally as concrete as possible:
:
: A Swing client performs searches by communicating with a Lucene
: search server, which is wrapped by a RESTful servlet.  The client
: wants to issue sophisticated queries that are not supported by
: QueryParser.

While I agree that the number one goal of this discussion is and should
be a "string representation of a query" which can represent any valid
query structure in a reliable manner, there are lots of things that could
satisfy that goal, and the fastest or easiest to impliment may not be the
best in the context of other use cases or goals that could also be
satisified without sacrificing on the primary goal.


I can think of at least two big use cases that I'm concerned about....

1) Human creation

While it's true that most "users" just want to type words, the Humans I'm
thinking of (and i imagine Paul was thinking as well) aren't users as much
as they are Developers of QA Testers who understand the internals of the
index, and know aout all the different types of Lucene queries, and want
to manually hit that REST servlet with a particular query construct.
It's perfectly fine to expect these humans to type XML, but lets not
make it extra hard on them.

This may map one to one with the internal representation...

   <BooleanQuery>
     <BooleanClause type="mayhave">
        <TermQuery boost="2.0">
          <Term><Field>content</Field><Value>java</Value></Term>
        </TermQuery>
     </BooleanClause>
     <BooleanClause type="mayhave">
        <TermQuery>
          <Term><Field>content</Field><Value>lucene</Value></Term>
        </TermQuery>
     </BooleanClause>
     <BooleanClause type="mustnothave">
        <TermQuery>
          <Term><Field>content</Field><Value>coffee</Value></Term>
        </TermQuery>
     </BooleanClause>
  </BooleanQuery>

...but so does this, and I know which one I'd rather have to type if I
were trying to debug something...

  <q t="boolean">
     <c t="mayhave"><q t="term" boost="2.0" f="contents">java</q></c>
     <c t="mayhave"><q t="term" f="contents">lucene</q></c>
     <c t="mustnothave"><q t="term" f="contents">coffee</q></c>
  </q>

Further more, the following may not have a one-to-one corrispondence with
the underlying implimentation of BooleanQuery, but that doesn't make it
a less correct way of identifying a query, and again, I know which one I'd
rather type if i were debugging...

  <q t="boolean">
     <c t="mayhave">
        <q t="term" boost="2.0" f="contents">java</q>
        <q t="term" f="contents">lucene</q>
     </c>
     <c t="mustnothave"><q t="term" f="contents">coffee</q></c>
  </q>


...I guess the gist of my point is: Just because the generator/parser is
expected to be a machine, doesn't mean we can't also make it easy for
humans too; nor should it mean we ignore huffman encoding.


2) Aliasing

Imagine that we were having this discussion a year or two ago (however
long it's been since Span queries were created) and the result was a new
XML query parser that used reflection to build up objects based on the XML
tags.  (Not the silly way where reflection is used everytime, the less
silly way where reflection is used the first time, and a Singleton
maintains a Map of XMLTag=>Class mappings for future parsing).  Lucene
Users go out and write their REST servlets that use this, and when their
many clients ask them how to generate a query, they give them examples
like this...

  <BooleanQuery>
    <PhraseQuery boost="2.0">
     <Term field="content">word1</Term>
     <Term field="content">word2</Term>
    </PhraseQuery>
    <TermQuery><Term field="content">word1</Term></TermQuery>
    <TermQuery><Term field="content">word2</Term></TermQuery>
  </BooleanQuery>

...and everybody is happy.  The Lucene user is a little anoyed one day
when he realizes that setting slop="5" generally gives "better" results,
because there's nothing he can change on his servlet to make this change
for everyone, he's got to contact all of his clients and suggest that they
make the change.  and then one day SpanQuery gets written, and our Lucene
User is drooling, but he's faced with the same problem: He has to tell all
of his clients that he recomends using a new query structure because or
there's no easy way to drop in SpanNearQuery instead of Phrase query.  He
briefly considers recompiling lucene jar after changing PhraseQuery to be
a thin subclass/wrapper arround SpanNearQuery, but then smacks himself
silly when he realizes how hard it would be to maintain that in the
future.


The moral of this story being that even though it's importnat to have
something that can (and by default *does*) provide a one to one mapping of
Query<->XML, this mapping shouldn't be implimented in such a way that it
prevents people from using it in more complex ways.


The idea I've been rolling arround in the back of my head for a few weeks,
is to have a generic parsing framework, which can be used with registered
"handlers" to parse XML into Queries, each handler being tied to a
specific xml tag, and having the ability to pass state information along
as child elements are parsed, and take action based on state information
passed back up after the child elements are complete.  To put it
another way: Imagine an API that wraps a SAX parser thin enough that you
can still do crazy stateful stuff, but thick enough that the output of
handlers must be Query objects, and a reusable handler for each type of
Query in the lucene code bases comes with it out of the box.

Now including a generic XML<->Query API to distribute with lucene becomes
very easy -- it would also alow this API to have a few shortcuts for the
really common query types so that simple queries still are somewhat human
readable.  It's also easy for clients that want to add their own types of
queries, or to change the default slop on phrase queries to 5, or map a
"phrase" query to a SpanNearQuery object, or add a new tag that just
changes the analyzer used by child tags, or a tag that changes the handler
registry for dealing with it's children, etc....


Hopefully that makes sense to someone besides just me.  It's certainly a
lot more complexity then a simple one to one mapping, but it seems to me
like the flexability is worth spending the extra time to design/build it.

Especially if I can convince Yonik's boss to pay him to do all the hard
work. :)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org