lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Pipitone <jpipit...@mshri.on.ca>
Subject Re: '-' character not interpreted correctly in field names
Date Wed, 14 May 2003 18:25:36 GMT


Eric Isakson wrote:
> I just looked at the QueryParser.jj code, your field names 
 > never get processed by the analyzer. It does look like the
 > query parser will honor escapes though. I haven't tried
 > this, but try a query like "foo\-bar:foo" and have
> a look at the QueryParser.jj file for how it handles field
 > names when parsing your query.

Hrm.. that's what I had found too.  So, you're saying that, other than 
escaping dashes, I'd have to change QueryParser.. ?

I'm not too familiar just yet with JavaCC syntax, so reading through 
QueryParser is a little tough going.  Thanks Eric,

jp


> -----Original Message-----
> From: Jon Pipitone [mailto:jpipitone@mshri.on.ca] 
> Sent: Monday, May 12, 2003 4:03 PM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
> 
> 
> Hi Otis, Terry,
> 
>  >>>You can write a custom Analyzer that does not remove dashes from  >>>tokens,
and use it for both indexing and searching.  >>>  >>>This is a frequent
question and answer on this list.
> 
> Sorry for the noise, but I haven't been able to find a solution in the 
> mailing list archives, or by writing my own analyzer:
> 
> 	public class MyAnalyzer extends Analyzer {
> 	public TokenStream tokenStream(String fieldName, Reader reader) 		{
> 		return new CharTokenizer(reader) {
> 			protected boolean isTokenChar(char c) {
> 				return Character.isLetter(c) || c == '-';
> 			}
> 		};
> 	}
> 	}
> 
> I parse a query like this:
> 
> 	String queryString = "foo-bar:foo";
> 	String queryResult =
> 		QueryParser.parse(queryString, "body", new MyAnalyzer())
> 
> With the output:
> 	body:foo -bar:foo
> 
> But I would expect the output:
> 	 foo-bar:foo
> 
> If I print out the tokens that MyAnalyzer produces I do get "foo-bar" 
> and then "foo".
> 
> Any pointers on what I'm doing wrong?
> 
> jp
> 
> 
> 
> 
>>>>--- Jon Pipitone <jpipitone@mshri.on.ca> wrote:
>>>>
>>>>
>>>>>Hi all,
>>>>>
>>>>>
>>>>>>I believe that the tokenizer treats a dash as a token
>>>>>
>>>separator.
>>>
>>>
>>>>>>Hence, the only way, as I recall, to eliminate this behavior
>>>>>
>>>is
>>>
>>>
>>>>>>to modify QueryParser.jj so it doesn't do this.  However,
>>>>>
>>>doing
>>>
>>>
>>>>>>this can cause some other problems, like hyphenated words at a 
>>>>>>line break and the like.
>>>>>
>>>>>I've recently started using lucene and I'm running into the same 
>>>>>issue with the query parser.  I'd like to use queries that contain
>>>>
>>>dashes
>>>
>>>
>>>>>in
>>>>>the field name, but as far as I can tell it seems that the
>>>>
>>>current
>>>
>>>
>>>>>query
>>>>>grammar treats field names as terms, and so, as Terry notes, a
>>>>
>>>dash
>>>
>>>
>>>>>becomes a token seperator.
>>>>>
>>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by 
>>>>>creating a seperate non-terminal for field names.
>>>>>
>>>>>Has anyone done any work on this already?  Is modifying 
>>>>>QueryParser.jj the best approach?
>>>>>
>>>>>Thanks,
>>>>>jp
>>>>>
>>>>>
>>>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message