lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Hadianto <vict...@nuix.com.au>
Subject Re: '-' character not interpreted correctly in field names (solution)
Date Thu, 10 Jul 2003 04:06:34 GMT
Eric and others,

I finally found a solution for this problem, although it is really specific to 
our need.

The simplest solution in the end is redefining what a "Term" is about. At the 
moment, QueryParser will parse the following:

t-shirt as

+t -shirt

Which, in my opinion, is not really acceptable. A more sensible parsing will 
parse "t-shirt" as "t-shirt". If a user wants to do a query for "t" without 
the word "shirt" on it then the query should really be:

t -shirt
 ^ space here.

Similarly, a field query such as:

model:t-shirt

should really be interpreted as "model:t-shirt" not +model:t -shirt. I this it 
really make more sense to have the requirement of having a space before the 
"-" to identify a NOT query.

Onward to the code change, as I have said earlier it is specific for our 
application use and thus may not be relevant to most other people. Some of 
our field name have the "-" sign in it. Thus by changing the TERM_CHAR 
definition to:

<#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >

makes QueryParser compatible with our need. 


Cheers,

Victor


On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote:
> Yep tried that. Actually there is more to the creation of the field than
> just in this line:
>
> fieldToken=<TERM> <COLON> { field = fieldToken.image; }
>
>
> Because I've created a <FIELDNAME> which is exactly the same with <TERM>
> which
>
> looks like this:
> | <FIELDNAME: <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
>
> and change fieldToken to:
>
> fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
>
> And it doesn't work. Simple query such as to:tom* is parsed as blank query.
>
> I will continue looking at this problem and will post my solution if I get
> it, in the mean time I really do appreciate any help and suggestions.
>
> cheers,
>
> victor
>
> On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> > You left out the ~ character in your _FIELDNAME_START_CHAR production.
> > That character tells the grammar that it should take all the characters
> > except the ones you specified (the complement).
> >
> > Change:
> > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> >
> > To:
> > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> >
> > and it should probably work.
> >
> > Eric
> >
> > -----Original Message-----
> > From: Victor Hadianto [mailto:victorh@nuix.com.au]
> > Sent: Wednesday, July 09, 2003 4:53 AM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> >
> > Hi Erik and others,
> >
> > I'm looking for a similar solution where I need QueryParser not to drop
> > the "-" characters from the field name. Hower outside the field I do want
> > the - sign interpreted as "not" modifier.
> >
> > I'm definitely not an expert in JavaCC and to be honest I only have a
> > limited idea about Erik's suggestion work,
> >
> > Anyway I followed the suggestion and added the following:
> > | <#_WHITESPACE: ( " " | "\t" ) >
> > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
> > | "^",
> >
> >                                "[", "]", "\"", "{", "}", "~", "*", "?" ]
> >
> >                              | <_ESCAPED_CHAR> ) >
> > |
> > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR>
) >
> >
> > and again below I added:
> > | <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > | <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)*  >
> >
> > And I changed:
> >
> >     LOOKAHEAD(2)
> >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> >
> > to: ...
> >
> >     LOOKAHEAD(2)
> >     fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
> >
> >
> > Well after doing all this mods all the query that involved field names
> > cause problem, for example if I searched for
> >
> > fieldname:hello
> >
> > The query is blank (yes blank, nothing in it)
> >
> > and if the fieldname does contain a dash ("-") for example:
> > field-name:hello
> >
> > They query is: +field -name
> >
> > hello is dropped.
> >
> >
> > Does anyone has any idea? Help and suggestions will be much appreciated.
> > I really need to get this dash working, changing the field name will be
> > my last resort which I won't explore until I really have to.
> >
> >
> > Thanks,
> >
> > Victor
> >
> > On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> > > I think the query parser changes would not be too bad, I've outlined a
> > > couple of relavant lines you should look at so you don't have to try
> > > and comprehend the productions for the entire QueryParser. I do not
> > > think I would like to have to maintain one of those myself though.
> > > Your other unmentioned alternative is to choose field names that match
> > > the <TERM> production of QueryParser.jj without escapes.
> > >
> > > QueryParser.jj line 557:
> > >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> > >
> > > and earlier...
> > >  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
> > >                           "[", "]", "\"", "{", "}", "~", "*", "?" ] >
> > >
> > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> > > | "^",
> > >
> > >                            "[", "]", "\"", "{", "}", "~", "*", "?" ]
> > >
> > >                        | <_ESCAPED_CHAR> ) >
> > > |
> > > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
> > >
> > > ...
> > >
> > > <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > >
> > > So the characters you need to avoid in your field names are the ones
> > > from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[",
> > > "]", "\"", "{", "}", "~", "*", "?" ]
> > >
> > > If you need to modify the parser, you will probably want to add a
> > > FIELDNAME token and other supporting productions that look really
> > > similar to these lines I've copied but modify the complement, ~[...],
> > > at the beginning of _FIELDNAME_START_CHAR (you would add this
> > > production) so it will match the "-" that you are using in your field
> > > names (and fix it to match any other characters you want to use in
> > > field names that it doesn't allow right now).
> > >
> > > Eric
> > >
> > > -----Original Message-----
> > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > Sent: Wednesday, May 14, 2003 2:26 PM
> > > To: Lucene Users List
> > > Subject: Re: '-' character not interpreted correctly in field names
> > >
> > > Eric Isakson wrote:
> > > > I just looked at the QueryParser.jj code, your field names
> > > >
> > >  > never get processed by the analyzer. It does look like the  > query
> > >
> > > parser will honor escapes though. I haven't tried  > this, but try a
> > > query like "foo\-bar:foo" and have
> > >
> > > > a look at the QueryParser.jj file for how it handles field
> > > >
> > >  > names when parsing your query.
> > >
> > > Hrm.. that's what I had found too.  So, you're saying that, other than
> > > escaping dashes, I'd have to change QueryParser.. ?
> > >
> > > I'm not too familiar just yet with JavaCC syntax, so reading through
> > > QueryParser is a little tough going.  Thanks Eric,
> > >
> > > jp
> > >
> > > > -----Original Message-----
> > > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > > Sent: Monday, May 12, 2003 4:03 PM
> > > > To: Lucene Users List
> > > > Subject: Re: '-' character not interpreted correctly in field names
> > > >
> > > >
> > > > Hi Otis, Terry,
> > > >
> > > >  >>>You can write a custom Analyzer that does not remove dashes
from
> > > > >>>
> > > > >>>tokens, and use it for both indexing and searching.  >>>
 >>>This
> > > >
> > > > is a frequent question and answer on this list.
> > > >
> > > > Sorry for the noise, but I haven't been able to find a solution in
> > > > the mailing list archives, or by writing my own analyzer:
> > > >
> > > > 	public class MyAnalyzer extends Analyzer {
> > > > 	public TokenStream tokenStream(String fieldName, Reader reader) 		{
> > > > 		return new CharTokenizer(reader) {
> > > > 			protected boolean isTokenChar(char c) {
> > > > 				return Character.isLetter(c) || c == '-';
> > > > 			}
> > > > 		};
> > > > 	}
> > > > 	}
> > > >
> > > > I parse a query like this:
> > > >
> > > > 	String queryString = "foo-bar:foo";
> > > > 	String queryResult =
> > > > 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> > > >
> > > > With the output:
> > > > 	body:foo -bar:foo
> > > >
> > > > But I would expect the output:
> > > > 	 foo-bar:foo
> > > >
> > > > If I print out the tokens that MyAnalyzer produces I do get
> > > > "foo-bar" and then "foo".
> > > >
> > > > Any pointers on what I'm doing wrong?
> > > >
> > > > jp
> > > >
> > > >>>>--- Jon Pipitone <jpipitone@mshri.on.ca> wrote:
> > > >>>>>Hi all,
> > > >>>>>
> > > >>>>>>I believe that the tokenizer treats a dash as a token
> > > >>>
> > > >>>separator.
> > > >>>
> > > >>>>>>Hence, the only way, as I recall, to eliminate this
behavior
> > > >>>
> > > >>>is
> > > >>>
> > > >>>>>>to modify QueryParser.jj so it doesn't do this.  However,
> > > >>>
> > > >>>doing
> > > >>>
> > > >>>>>>this can cause some other problems, like hyphenated
words at a
> > > >>>>>>line break and the like.
> > > >>>>>
> > > >>>>>I've recently started using lucene and I'm running into
the same
> > > >>>>>issue with the query parser.  I'd like to use queries that
> > > >>>>>contain
> > > >>>
> > > >>>dashes
> > > >>>
> > > >>>>>in
> > > >>>>>the field name, but as far as I can tell it seems that
the
> > > >>>
> > > >>>current
> > > >>>
> > > >>>>>query
> > > >>>>>grammar treats field names as terms, and so, as Terry notes,
a
> > > >>>
> > > >>>dash
> > > >>>
> > > >>>>>becomes a token seperator.
> > > >>>>>
> > > >>>>>Terry suggests modifying the QueryParser.jj -- I would
suspect by
> > > >>>>>creating a seperate non-terminal for field names.
> > > >>>>>
> > > >>>>>Has anyone done any work on this already?  Is modifying
> > > >>>>>QueryParser.jj the best approach?
> > > >>>>>
> > > >>>>>Thanks,
> > > >>>>>jp
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message