lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luis Pureza <pur...@gmail.com>
Subject Re: Lucene QueryParser/Analyzer inconsistency
Date Thu, 19 Jun 2014 09:09:22 GMT
Unfortunately I spoke too soon. While the original example seems to have
been fixed, I'm still getting some unexpected results.

As per your suggestion, I modified the Analyzer to:

    @Override
    protected TokenStreamComponents createComponents(String field, Reader
in) {
        NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
        builder.add("/", " "); // Transform all forward slashes into
whitespace
        Reader mappingFilter = new MappingCharFilter(builder.build(), in);

        Tokenizer tokenizer = new WhitespaceTokenizer(version,
mappingFilter);
        return new TokenStreamComponents(tokenizer);
    }

When I try this:

        QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
MyAnalyzer(Version.LUCENE_48));
        System.err.println(parser.parse(QueryParser.escape("one/two")));

I get

    f:one f:two

as expected.

However, if I change the text to "hello one/two", I get:

    f:hello f:one/two

I can't figure out what's going on. My custom tokenizer seems to work well,
but I'd rather use Lucene's built-ins.

Thank you,

Luis



On Wed, Jun 18, 2014 at 3:38 PM, Luis Pureza <pureza@gmail.com> wrote:

> Thanks, that did work.
>
>
>
> On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky <jack@basetechnology.com>
> wrote:
>
>> Yeah, this is kind of tricky and confusing! Here's what happens:
>>
>> 1. The query parser "parses" the input string into individual source
>> terms, each delimited by white space. The escape is removed in this
>> process, but... no analyzer has been called at this stage.
>>
>> 2. The query parser (generator) calls the analyzer for each source term.
>> Your analyzer is called at this stage, but... the escape is already gone,
>> so... the <backslash><slash> mapping rule is not triggered, leaving the
>> slash recorded in the source term from step 1.
>>
>> You do need the backslash in your original query because a slash
>> introduces a regex query term. It is added by the escape method you call,
>> but the escaping will be gone by the time your analyzer is called.
>>
>> So, just try a simple, unescaped slash in your char mapping table.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Luis Pureza
>> Sent: Tuesday, June 17, 2014 1:43 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene QueryParser/Analyzer inconsistency
>>
>>
>> Hi,
>>
>> I'm experience a puzzling behaviour with the QueryParser and was hoping
>> someone around here can help me.
>>
>> I have a very simple Analyzer that tries to replace forward slashes (/) by
>> spaces. Because QueryParser forces me to escape strings with slashes
>> before
>> parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
>> with a single space. The analyzer is defined as follows:
>>
>> @Override
>> protected TokenStreamComponents createComponents(String field, Reader in)
>> {
>>    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>>    builder.add("\\/", " ");
>>    Reader mappingFilter = new MappingCharFilter(builder.build(), in);
>>
>>    Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
>>    return new TokenStreamComponents(tokenizer);
>> }
>>
>> Then I use this analyzer in the QueryParser to parse a string with dashes:
>>
>> String text = QueryParser.escape("one/two");
>> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
>> MyAnalyzer(Version.LUCENE_48));
>> System.err.println(parser.parse(text));
>>
>> The expected output would be
>>
>> f:one f:two
>>
>> However, I get:
>>
>> f:one/two
>>
>> The puzzling thing is that when I debug the analyzer, it tokenizes the
>> input string correctly, returning two tokens instead of one.
>>
>> What is going on?
>>
>> Many thanks,
>>
>> Luís Pureza
>>
>> P.S.: I was able to fix this issue temporarily by creating my own
>> tokenizer
>> that tokenizes on whitespace and slashes. However, I still don't
>> understand
>> what's going on.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message