lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Hadianto <vict...@nuix.com.au>
Subject Re: '-' character not interpreted correctly in field names (solution)
Date Fri, 11 Jul 2003 00:41:16 GMT
Okay attached is the diff file to allow t-shirt to be interpreted as 
"t-shirt". Queries that start with a "-" character behave as expected, well 
at least as we expected. 

For example: -shirt +pants as -shirt +pants

One thing I need to mention is (I dig this from earlier discussion in this 
list), that Doug Cutting said this (about the similar change someone else 
propose):

<---- cut --->
Lixin Meng wrote:
> Therefore, it would be preferable to treat all hyphen in the same way.
> Either as a delimiter or as part of the word (maybe with a flag at the API).

If we change StandardTokenizer in this way then we risk breaking all the 
applications that currently use it and depend on its current behaviour. 
  So I'm reluctant to make this change.

 From the StandardTokenizer documentation:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

"Many applications have specific tokenizer needs. If this tokenizer does 
not suit your application, please consider copying this source code 
directory to your project and maintaining your own grammar-based tokenizer."

Also, if you construct a tokenizer that you think is more generally 
useful than StandardTokenizer, please contribute it by mailing it to one 
of the Lucene mailing lists.

Thanks,

Doug

<---- cut ---->

So yes this change _may_ break other exisiting applications.

cheers,
victor


On Thu, 10 Jul 2003 08:34 pm, Otis Gospodnetic wrote:
> I think this is a fine change, that others would welcome, too.
> No?
> Does your change work with queries that start with a '-' character?
> For example: -shirt +pants
> (note: no space before '-shirt')
>
> If so, I think we could include this change in QueryParser.jj if you
> send the diff, as I recall others wondering why queries like t-shirt
> get misinterpreted as +t -shirt.
>
> Thanks,
> Otis
>
> --- Victor Hadianto <victorh@nuix.com.au> wrote:
> > Eric and others,
> >
> > I finally found a solution for this problem, although it is really
> > specific to
> > our need.
> >
> > The simplest solution in the end is redefining what a "Term" is
> > about. At the
> > moment, QueryParser will parse the following:
> >
> > t-shirt as
> >
> > +t -shirt
> >
> > Which, in my opinion, is not really acceptable. A more sensible
> > parsing will
> > parse "t-shirt" as "t-shirt". If a user wants to do a query for "t"
> > without
> > the word "shirt" on it then the query should really be:
> >
> > t -shirt
> >  ^ space here.
> >
> > Similarly, a field query such as:
> >
> > model:t-shirt
> >
> > should really be interpreted as "model:t-shirt" not +model:t -shirt.
> > I this it
> > really make more sense to have the requirement of having a space
> > before the
> > "-" to identify a NOT query.
> >
> > Onward to the code change, as I have said earlier it is specific for
> > our
> > application use and thus may not be relevant to most other people.
> > Some of
> > our field name have the "-" sign in it. Thus by changing the
> > TERM_CHAR
> > definition to:
> >
> > <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
> >
> > makes QueryParser compatible with our need.
> >
> >
> > Cheers,
> >
> > Victor
> >

Mime
View raw message