lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <Eric.Isak...@sas.com>
Subject RE: '-' character not interpreted correctly in field names
Date Wed, 14 May 2003 18:54:50 GMT
I think the query parser changes would not be too bad, I've outlined a couple of relavant lines
you should look at so you don't have to try and comprehend the productions for the entire
QueryParser. I do not think I would like to have to maintain one of those myself though. Your
other unmentioned alternative is to choose field names that match the <TERM> production
of QueryParser.jj without escapes.

QueryParser.jj line 557:
    fieldToken=<TERM> <COLON> { field = fieldToken.image; }

and earlier...
 <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
                          "[", "]", "\"", "{", "}", "~", "*", "?" ] >
| <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
                           "[", "]", "\"", "{", "}", "~", "*", "?" ]
                       | <_ESCAPED_CHAR> ) >
| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >

...

<TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >

So the characters you need to avoid in your field names are the ones from _ESCAPED_CHAR, 
[ "\\", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "{", "}", "~", "*", "?" ]

If you need to modify the parser, you will probably want to add a FIELDNAME token and other
supporting productions that look really similar to these lines I've copied but modify the
complement, ~[...], at the beginning of _FIELDNAME_START_CHAR (you would add this production)
so it will match the "-" that you are using in your field names (and fix it to match any other
characters you want to use in field names that it doesn't allow right now).

Eric

-----Original Message-----
From: Jon Pipitone [mailto:jpipitone@mshri.on.ca] 
Sent: Wednesday, May 14, 2003 2:26 PM
To: Lucene Users List
Subject: Re: '-' character not interpreted correctly in field names




Eric Isakson wrote:
> I just looked at the QueryParser.jj code, your field names
 > never get processed by the analyzer. It does look like the
 > query parser will honor escapes though. I haven't tried
 > this, but try a query like "foo\-bar:foo" and have
> a look at the QueryParser.jj file for how it handles field
 > names when parsing your query.

Hrm.. that's what I had found too.  So, you're saying that, other than 
escaping dashes, I'd have to change QueryParser.. ?

I'm not too familiar just yet with JavaCC syntax, so reading through 
QueryParser is a little tough going.  Thanks Eric,

jp


> -----Original Message-----
> From: Jon Pipitone [mailto:jpipitone@mshri.on.ca]
> Sent: Monday, May 12, 2003 4:03 PM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
> 
> 
> Hi Otis, Terry,
> 
>  >>>You can write a custom Analyzer that does not remove dashes from  
> >>>tokens, and use it for both indexing and searching.  >>>  >>>This

> is a frequent question and answer on this list.
> 
> Sorry for the noise, but I haven't been able to find a solution in the
> mailing list archives, or by writing my own analyzer:
> 
> 	public class MyAnalyzer extends Analyzer {
> 	public TokenStream tokenStream(String fieldName, Reader reader) 		{
> 		return new CharTokenizer(reader) {
> 			protected boolean isTokenChar(char c) {
> 				return Character.isLetter(c) || c == '-';
> 			}
> 		};
> 	}
> 	}
> 
> I parse a query like this:
> 
> 	String queryString = "foo-bar:foo";
> 	String queryResult =
> 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> 
> With the output:
> 	body:foo -bar:foo
> 
> But I would expect the output:
> 	 foo-bar:foo
> 
> If I print out the tokens that MyAnalyzer produces I do get "foo-bar"
> and then "foo".
> 
> Any pointers on what I'm doing wrong?
> 
> jp
> 
> 
> 
> 
>>>>--- Jon Pipitone <jpipitone@mshri.on.ca> wrote:
>>>>
>>>>
>>>>>Hi all,
>>>>>
>>>>>
>>>>>>I believe that the tokenizer treats a dash as a token
>>>>>
>>>separator.
>>>
>>>
>>>>>>Hence, the only way, as I recall, to eliminate this behavior
>>>>>
>>>is
>>>
>>>
>>>>>>to modify QueryParser.jj so it doesn't do this.  However,
>>>>>
>>>doing
>>>
>>>
>>>>>>this can cause some other problems, like hyphenated words at a
>>>>>>line break and the like.
>>>>>
>>>>>I've recently started using lucene and I'm running into the same
>>>>>issue with the query parser.  I'd like to use queries that contain
>>>>
>>>dashes
>>>
>>>
>>>>>in
>>>>>the field name, but as far as I can tell it seems that the
>>>>
>>>current
>>>
>>>
>>>>>query
>>>>>grammar treats field names as terms, and so, as Terry notes, a
>>>>
>>>dash
>>>
>>>
>>>>>becomes a token seperator.
>>>>>
>>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by
>>>>>creating a seperate non-terminal for field names.
>>>>>
>>>>>Has anyone done any work on this already?  Is modifying
>>>>>QueryParser.jj the best approach?
>>>>>
>>>>>Thanks,
>>>>>jp
>>>>>
>>>>>
>>>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message