lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Smith" <ssm...@mainstreamdata.com>
Subject RE: QueryParser in 3.x
Date Fri, 17 Sep 2010 17:34:15 GMT
First, let me say that I didn't think the problem was in QueryParser and I apologize if that's
how it sounded.  QueryParser is a central method to Lucene.  1 of me having problems with
QueryParser, 1000's of others not.  Is the problem more likely in my code or lucene.  We'll
all agree on the answer to that question.

As further proof, I ran the following code.  The first part is from Simon's email (thanks
for that snippet) and the second part is from LIA2.

        // code from Willnauer email
        Analyzer a = new MyAnalyzer(Version.LUCENE_30);
        TokenStream stream = a.reusableTokenStream("body", new StringReader("Europabörsen"));

        TermAttribute attr = stream.addAttribute(TermAttribute.class);
        while(stream.incrementToken())
        {
          System.out.println(attr.term());
        }

        // code from LIA2
        stream = a.tokenStream("body", new StringReader("Europabörsen"));
        TermAttribute term = stream.addAttribute(TermAttribute.class);
        while (stream.incrementToken())
        {
            System.out.print(term.term());
        }


The answer I got back was:
europabörsen
europaborsen

I realized the difference between these two was whether I was getting the reusableTokeStream
or the tokenStream.  In looking at my code, the ASCIIFoldingFilter was not in the filter setup
for the resusableTokenStream().  It was for the tokenStream().  I added it to the reusableTokenStream
and I now get the result I wanted.  The above code snippet generates the word without the
umlaut in both cases.  So, problem solved.

Thanks to Simon for putting on the right track.

Scott


-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] 
Sent: Friday, September 17, 2010 1:03 AM
To: java-user@lucene.apache.org
Subject: Re: QueryParser in 3.x

On Fri, Sep 17, 2010 at 1:06 AM, Scott Smith <ssmith@mainstreamdata.com> wrote:
> I recently upgraded to Lucene 3.0 and am seeing some new behavior that I don't understand.
 Perhaps someone can explain why.
>
>
>
> I have a custom analyzer.  Part of the analyzer uses the AsciiFoldingFilter.  If I
run a word with an umlaut through that analyzer using the AnalyzerDemo code in LIA2, as expected,
I get the same word except that the umlauted letter is now a simple ascii letter (no umlaut).
 That's what I would expect and want.
>
>
>
> If I create a Queryparser using the call "new QueryParser(LUCENE_30, "body", myAnalyzer)
and then call the parse() method passing the same word, I can see that the query parser has
not removed the umlaut.  The string it has is "+body: Europabörsen".
>
This seems to be an issue with your analyzer rather than with the
QueryParser. Since QueryParser didn't really change its behavior in
3.0 except of some default values. Can you provide more info what you
did with your analyzer? Did you try running the term with umlaut chars
through your Analyzer / Tokenstream directly? Something like that:

Analyzer a = new MyAnalyzer();
TokenStream stream = a.reusableTokenStream("body", new
StringReader("Europabörsen"));
TermAttribute attr = stream.addAttribute(TermAttribute.class);
while(stream.incrementToken())
  System.out.println(attr.term());

simon
>
>
> I know I had to make a number of changes to the analyzer and the tokenizer to upgrade
to 3.x.  Is there something very different from the 2.x version that I'm likely missing.
>
>
>
> Anyone have any thoughts?
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message