lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hankyu Kim <gksr...@gmail.com>
Subject Re: Query beginning with special characters
Date Mon, 14 Jan 2013 11:42:14 GMT
I did intend to ignore all the spaces, so that's not the problem.

Here's the tokenization chain in customAnalyser class, extending Analyser
    @Override
    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
        NGramTokenizer src = new NGramTokenizer(matchVersion, reader); //
My NGramTokenizer

        TokenStream tok = new LowerCaseFilter(matchVersion, src);
        return new TokenStreamComponents(src, tok);
    }

NGramTokenizer's incrementToken() method.
    @Override
    public boolean incrementToken() throws IOException
    {
        clearAttributes();
        char[] termBuffer = termAtt.buffer();
        termAtt.setLength(GRAM_SIZE);

        startOffset++;                            // Values for offset
attribute
        offsetAtt.setOffset(startOffset, startOffset+ GRAM_SIZE-1);

        do
        {
            termBuffer[0] = termBuffer[1];            // Shift characters
to left
            termBuffer[1] = termBuffer[2];

            // Get next non-whitespace character
            int c = ' ';
            while(Character.isWhitespace(c))
            {
                if(position >= dataLength) // Read in buffer, if position
gets out of bound
                {
                    if(charUtils.fill(iobuffer, input))
                    {
                        dataLength = iobuffer.getLength();
                        position = 0;
                    }
                    else    // EOF
                        return false;
                }

                c = charUtils.codePointAt(iobuffer.getBuffer(),
position);    // Get next character
                position++;
            }

            Character.toChars(c, termBuffer, GRAM_SIZE-1);
//
System.out.print("'"+termBuffer[0]+termBuffer[1]+termBuffer[2]+"', ");
// This is how I got the output in the last email

        }
        while(Character.getNumericValue(termBuffer[0]) == -1);

        return true;
    }

2013/1/14 Ian Lea <ian.lea@gmail.com>

> In fact I see you are ignoring all spaces between words.  Maybe that's
> deliberate.  Break it down into the smallest possible complete code
> sample that shows the problem and post that.
>
>
> --
> Ian.
>
>
> On Mon, Jan 14, 2013 at 11:02 AM, Ian Lea <ian.lea@gmail.com> wrote:
> > It won't be IndexWriter or IndexWriterConfig.  What exactly does your
> > analyzer do - what is the full chain of tokenization?  Are you saying
> > that  ':)a' and ')an' are not indexed?  Surely that is correct given
> > your input with a space after the :).  And before as well so 's:)', is
> > also suspect.
> >
> > --
> > Ian.
> >
> >
> > On Mon, Jan 14, 2013 at 7:42 AM, Hankyu Kim <gksrb92@gmail.com> wrote:
> >> I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so
> >> setAllowLeadingWildcard() is irrelevant.
> >> I also realised the issue wasn't with querying, but it was indexing
> whihch
> >> left the terms with leading special character out.
> >>
> >> My goal was to do a fuzzymatch by creating a trigram index. The idea is
> to
> >> tokenize the documents into trigrams, not by words during indexing and
> >> searching so lucene can search for part of a word or phrase.
> >>
> >> Say the original text in the document said : "Sample text with special
> >> characters :) and such"
> >> It's tokenized into
> >>  'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi',
> >> 'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc',
> >> 'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:',
> >> 's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'.
> >> The above is output from my tokenizer so there's nothing wrong with
> >> creating trigrmas. However, when I check the index with lukeall, all the
> >> other trigrams are indexed correctly except for the terms ':)a' and
> ')an'.
> >> Since the missing indexes are related to lucene's special characters, I
> >> don't think it's got to do with my custom code.
> >>
> >> I only changed analyser in the IndexFiles.java from demo to index the
> file.
> >> Honestly, I can't locate even the exact class in which the problem is
> >> caused. I'm only guessing IndexWriterConfig or IndexWriter is discarding
> >> the terms with leading special characters.
> >>
> >> I hope the above infromation helps.
> >>
> >> 2013/1/11 Ian Lea <ian.lea@gmail.com>
> >>
> >>> QueryParser has a setAllowLeadingWildcard() method.  Could that be
> >>> relevant?
> >>>
> >>> What version of lucene?  Can you post some simple examples of what
> >>> does/doesn't work? Post the smallest possible, but complete, code that
> >>> demonstrates the problem?
> >>>
> >>>
> >>> With any question that mentions a custom version of something, that
> >>> custom version has to be the prime suspect for any problems.
> >>>
> >>>
> >>> --
> >>> Ian.
> >>>
> >>>
> >>> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksrb92@gmail.com>
> wrote:
> >>> > Hi.
> >>> >
> >>> > I've created a custom analyzer that treats special characters just
> like
> >>> any
> >>> > other. The index works fine all the time even when the query includes
> >>> > special characters, except when the special characters come to the
> >>> begining
> >>> > of the query.
> >>> >
> >>> > I'm using spanTermQuery and wildCardQuery, and they both seem to
> suffer
> >>> the
> >>> > same issue with queries begining with special characters. Is it a
> >>> > limitation of Lucene or am I missing something?
> >>> >
> >>> > Thanks
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message