lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: '-' character not interpreted correctly in field names (solution)
Date Thu, 10 Jul 2003 10:34:19 GMT
I think this is a fine change, that others would welcome, too.
No?
Does your change work with queries that start with a '-' character?
For example: -shirt +pants
(note: no space before '-shirt')

If so, I think we could include this change in QueryParser.jj if you
send the diff, as I recall others wondering why queries like t-shirt
get misinterpreted as +t -shirt.

Thanks,
Otis

--- Victor Hadianto <victorh@nuix.com.au> wrote:
> Eric and others,
> 
> I finally found a solution for this problem, although it is really
> specific to 
> our need.
> 
> The simplest solution in the end is redefining what a "Term" is
> about. At the 
> moment, QueryParser will parse the following:
> 
> t-shirt as
> 
> +t -shirt
> 
> Which, in my opinion, is not really acceptable. A more sensible
> parsing will 
> parse "t-shirt" as "t-shirt". If a user wants to do a query for "t"
> without 
> the word "shirt" on it then the query should really be:
> 
> t -shirt
>  ^ space here.
> 
> Similarly, a field query such as:
> 
> model:t-shirt
> 
> should really be interpreted as "model:t-shirt" not +model:t -shirt.
> I this it 
> really make more sense to have the requirement of having a space
> before the 
> "-" to identify a NOT query.
> 
> Onward to the code change, as I have said earlier it is specific for
> our 
> application use and thus may not be relevant to most other people.
> Some of 
> our field name have the "-" sign in it. Thus by changing the
> TERM_CHAR 
> definition to:
> 
> <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
> 
> makes QueryParser compatible with our need. 
> 
> 
> Cheers,
> 
> Victor
> 
> 
> On Thu, 10 Jul 2003 10:11 am, Victor Hadianto wrote:
> > Yep tried that. Actually there is more to the creation of the field
> than
> > just in this line:
> >
> > fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> >
> >
> > Because I've created a <FIELDNAME> which is exactly the same with
> <TERM>
> > which
> >
> > looks like this:
> > | <FIELDNAME: <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> >
> > and change fieldToken to:
> >
> > fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
> >
> > And it doesn't work. Simple query such as to:tom* is parsed as
> blank query.
> >
> > I will continue looking at this problem and will post my solution
> if I get
> > it, in the mean time I really do appreciate any help and
> suggestions.
> >
> > cheers,
> >
> > victor
> >
> > On Thu, 10 Jul 2003 03:24 am, Eric Isakson wrote:
> > > You left out the ~ character in your _FIELDNAME_START_CHAR
> production.
> > > That character tells the grammar that it should take all the
> characters
> > > except the ones you specified (the complement).
> > >
> > > Change:
> > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > >
> > > To:
> > > | <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > >
> > > and it should probably work.
> > >
> > > Eric
> > >
> > > -----Original Message-----
> > > From: Victor Hadianto [mailto:victorh@nuix.com.au]
> > > Sent: Wednesday, July 09, 2003 4:53 AM
> > > To: Lucene Users List
> > > Subject: Re: '-' character not interpreted correctly in field
> names
> > >
> > >
> > > Hi Erik and others,
> > >
> > > I'm looking for a similar solution where I need QueryParser not
> to drop
> > > the "-" characters from the field name. Hower outside the field I
> do want
> > > the - sign interpreted as "not" modifier.
> > >
> > > I'm definitely not an expert in JavaCC and to be honest I only
> have a
> > > limited idea about Erik's suggestion work,
> > >
> > > Anyway I followed the suggestion and added the following:
> > > | <#_WHITESPACE: ( " " | "\t" ) >
> > > | <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(",
> ")", ":",
> > > | "^",
> > >
> > >                                "[", "]", "\"", "{", "}", "~",
> "*", "?" ]
> > >
> > >                              | <_ESCAPED_CHAR> ) >
> > > |
> > > | <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR>
> ) >
> > >
> > > and again below I added:
> > > | <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > > | <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)* 
>
> > >
> > > And I changed:
> > >
> > >     LOOKAHEAD(2)
> > >     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
> > >
> > > to: ...
> > >
> > >     LOOKAHEAD(2)
> > >     fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image;
}
> > >
> > >
> > > Well after doing all this mods all the query that involved field
> names
> > > cause problem, for example if I searched for
> > >
> > > fieldname:hello
> > >
> > > The query is blank (yes blank, nothing in it)
> > >
> > > and if the fieldname does contain a dash ("-") for example:
> > > field-name:hello
> > >
> > > They query is: +field -name
> > >
> > > hello is dropped.
> > >
> > >
> > > Does anyone has any idea? Help and suggestions will be much
> appreciated.
> > > I really need to get this dash working, changing the field name
> will be
> > > my last resort which I won't explore until I really have to.
> > >
> > >
> > > Thanks,
> > >
> > > Victor
> > >
> > > On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> > > > I think the query parser changes would not be too bad, I've
> outlined a
> > > > couple of relavant lines you should look at so you don't have
> to try
> > > > and comprehend the productions for the entire QueryParser. I do
> not
> > > > think I would like to have to maintain one of those myself
> though.
> > > > Your other unmentioned alternative is to choose field names
> that match
> > > > the <TERM> production of QueryParser.jj without escapes.
> > > >
> > > > QueryParser.jj line 557:
> > > >     fieldToken=<TERM> <COLON> { field = fieldToken.image;
}
> > > >
> > > > and earlier...
> > > >  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":",
> "^",
> > > >                           "[", "]", "\"", "{", "}", "~", "*",
> "?" ] >
> > > >
> > > > | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")",
> ":",
> > > > | "^",
> > > >
> > > >                            "[", "]", "\"", "{", "}", "~", "*",
> "?" ]
> > > >
> > > >                        | <_ESCAPED_CHAR> ) >
> > > > |
> > > > | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR>
) >
> > > >
> > > > ...
> > > >
> > > > <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
> > > >
> > > > So the characters you need to avoid in your field names are the
> ones
> > > > from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^",
> "[",
> > > > "]", "\"", "{", "}", "~", "*", "?" ]
> > > >
> > > > If you need to modify the parser, you will probably want to add
> a
> > > > FIELDNAME token and other supporting productions that look
> really
> > > > similar to these lines I've copied but modify the complement,
> ~[...],
> > > > at the beginning of _FIELDNAME_START_CHAR (you would add this
> > > > production) so it will match the "-" that you are using in your
> field
> > > > names (and fix it to match any other characters you want to use
> in
> > > > field names that it doesn't allow right now).
> > > >
> > > > Eric
> > > >
> > > > -----Original Message-----
> > > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > > Sent: Wednesday, May 14, 2003 2:26 PM
> > > > To: Lucene Users List
> > > > Subject: Re: '-' character not interpreted correctly in field
> names
> > > >
> > > > Eric Isakson wrote:
> > > > > I just looked at the QueryParser.jj code, your field names
> > > > >
> > > >  > never get processed by the analyzer. It does look like the 
> > query
> > > >
> > > > parser will honor escapes though. I haven't tried  > this, but
> try a
> > > > query like "foo\-bar:foo" and have
> > > >
> > > > > a look at the QueryParser.jj file for how it handles field
> > > > >
> > > >  > names when parsing your query.
> > > >
> > > > Hrm.. that's what I had found too.  So, you're saying that,
> other than
> > > > escaping dashes, I'd have to change QueryParser.. ?
> > > >
> > > > I'm not too familiar just yet with JavaCC syntax, so reading
> through
> > > > QueryParser is a little tough going.  Thanks Eric,
> > > >
> > > > jp
> > > >
> > > > > -----Original Message-----
> > > > > From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> > > > > Sent: Monday, May 12, 2003 4:03 PM
> > > > > To: Lucene Users List
> > > > > Subject: Re: '-' character not interpreted correctly in field
> names
> > > > >
> > > > >
> > > > > Hi Otis, Terry,
> > > > >
> > > > >  >>>You can write a custom Analyzer that does not remove
> dashes from
> > > > > >>>
> > > > > >>>tokens, and use it for both indexing and searching. 
>>> 
> >>>This
> > > > >
> > > > > is a frequent question and answer on this list.
> > > > >
> > > > > Sorry for the noise, but I haven't been able to find a
> solution in
> > > > > the mailing list archives, or by writing my own analyzer:
> > > > >
> > > > > 	public class MyAnalyzer extends Analyzer {
> > > > > 	public TokenStream tokenStream(String fieldName, Reader
> reader) 		{
> > > > > 		return new CharTokenizer(reader) {
> > > > > 			protected boolean isTokenChar(char c) {
> > > > > 				return Character.isLetter(c) || c == '-';
> > > > > 			}
> > > > > 		};
> > > > > 	}
> > > > > 	}
> > > > >
> > > > > I parse a query like this:
> > > > >
> > > > > 	String queryString = "foo-bar:foo";
> > > > > 	String queryResult =
> > > > > 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> > > > >
> > > > > With the output:
> > > > > 	body:foo -bar:foo
> > > > >
> > > > > But I would expect the output:
> > > > > 	 foo-bar:foo
> > > > >
> > > > > If I print out the tokens that MyAnalyzer produces I do get
> > > > > "foo-bar" and then "foo".
> > > > >
> > > > > Any pointers on what I'm doing wrong?
> > > > >
> > > > > jp
> > > > >
> > > > >>>>--- Jon Pipitone <jpipitone@mshri.on.ca> wrote:
> > > > >>>>>Hi all,
> > > > >>>>>
> > > > >>>>>>I believe that the tokenizer treats a dash as
a token
> > > > >>>
> > > > >>>separator.
> > > > >>>
> > > > >>>>>>Hence, the only way, as I recall, to eliminate
this
> behavior
> > > > >>>
> > > > >>>is
> > > > >>>
> > > > >>>>>>to modify QueryParser.jj so it doesn't do this.
 However,
> > > > >>>
> > > > >>>doing
> > > > >>>
> > > > >>>>>>this can cause some other problems, like hyphenated
words
> at a
> > > > >>>>>>line break and the like.
> > > > >>>>>
> > > > >>>>>I've recently started using lucene and I'm running
into
> the same
> > > > >>>>>issue with the query parser.  I'd like to use queries
that
> > > > >>>>>contain
> > > > >>>
> > > > >>>dashes
> > > > >>>
> > > > >>>>>in
> > > > >>>>>the field name, but as far as I can tell it seems
that the
> > > > >>>
> > > > >>>current
> > > > >>>
> > > > >>>>>query
> > > > >>>>>grammar treats field names as terms, and so, as Terry
> notes, a
> > > > >>>
> > > > >>>dash
> > > > >>>
> > > > >>>>>becomes a token seperator.
> > > > >>>>>
> > > > >>>>>Terry suggests modifying the QueryParser.jj -- I would
> suspect by
> > > > >>>>>creating a seperate non-terminal for field names.
> > > > >>>>>
> > > > >>>>>Has anyone done any work on this already?  Is modifying
> > > > >>>>>QueryParser.jj the best approach?
> > > > >>>>>
> > > > >>>>>Thanks,
> > > > >>>>>jp
> > >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message