lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Query beginning with special characters
Date Mon, 14 Jan 2013 12:07:21 GMT
No problem.  Glad you found the error.  It's always in the custom code
somewhere.


--
Ian.


On Mon, Jan 14, 2013 at 12:04 PM, Hankyu Kim <gksrb92@gmail.com> wrote:
> I just found the cause of error and you were right about my code being the
> source.
> I used "Character.getNumericValue(termBuffer[0]) == -1" to test if
> termBuffer[0] is equal to null, but apparently the special characters
> return -1 as well when given as parameter.
>
> Thank you for your help.
>
> 2013/1/14 Hankyu Kim <gksrb92@gmail.com>
>
>> I did intend to ignore all the spaces, so that's not the problem.
>>
>> Here's the tokenization chain in customAnalyser class, extending Analyser
>>     @Override
>>     protected TokenStreamComponents createComponents(String fieldName,
>> Reader reader) {
>>         NGramTokenizer src = new NGramTokenizer(matchVersion, reader); //
>> My NGramTokenizer
>>
>>         TokenStream tok = new LowerCaseFilter(matchVersion, src);
>>         return new TokenStreamComponents(src, tok);
>>     }
>>
>> NGramTokenizer's incrementToken() method.
>>     @Override
>>     public boolean incrementToken() throws IOException
>>     {
>>         clearAttributes();
>>         char[] termBuffer = termAtt.buffer();
>>         termAtt.setLength(GRAM_SIZE);
>>
>>         startOffset++;                            // Values for offset
>> attribute
>>         offsetAtt.setOffset(startOffset, startOffset+ GRAM_SIZE-1);
>>
>>         do
>>         {
>>             termBuffer[0] = termBuffer[1];            // Shift characters
>> to left
>>             termBuffer[1] = termBuffer[2];
>>
>>             // Get next non-whitespace character
>>             int c = ' ';
>>             while(Character.isWhitespace(c))
>>             {
>>                 if(position >= dataLength) // Read in buffer, if position
>> gets out of bound
>>                 {
>>                     if(charUtils.fill(iobuffer, input))
>>                     {
>>                         dataLength = iobuffer.getLength();
>>                         position = 0;
>>                     }
>>                     else    // EOF
>>                         return false;
>>                 }
>>
>>                 c = charUtils.codePointAt(iobuffer.getBuffer(),
>> position);    // Get next character
>>                 position++;
>>             }
>>
>>             Character.toChars(c, termBuffer, GRAM_SIZE-1);
>> //
>> System.out.print("'"+termBuffer[0]+termBuffer[1]+termBuffer[2]+"', ");
>> // This is how I got the output in the last email
>>
>>         }
>>         while(Character.getNumericValue(termBuffer[0]) == -1);
>>
>>         return true;
>>
>>     }
>>
>> 2013/1/14 Ian Lea <ian.lea@gmail.com>
>>
>>> In fact I see you are ignoring all spaces between words.  Maybe that's
>>> deliberate.  Break it down into the smallest possible complete code
>>> sample that shows the problem and post that.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Mon, Jan 14, 2013 at 11:02 AM, Ian Lea <ian.lea@gmail.com> wrote:
>>> > It won't be IndexWriter or IndexWriterConfig.  What exactly does your
>>> > analyzer do - what is the full chain of tokenization?  Are you saying
>>> > that  ':)a' and ')an' are not indexed?  Surely that is correct given
>>> > your input with a space after the :).  And before as well so 's:)', is
>>> > also suspect.
>>> >
>>> > --
>>> > Ian.
>>> >
>>> >
>>> > On Mon, Jan 14, 2013 at 7:42 AM, Hankyu Kim <gksrb92@gmail.com> wrote:
>>> >> I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so
>>> >> setAllowLeadingWildcard() is irrelevant.
>>> >> I also realised the issue wasn't with querying, but it was indexing
>>> whihch
>>> >> left the terms with leading special character out.
>>> >>
>>> >> My goal was to do a fuzzymatch by creating a trigram index. The idea
>>> is to
>>> >> tokenize the documents into trigrams, not by words during indexing and
>>> >> searching so lucene can search for part of a word or phrase.
>>> >>
>>> >> Say the original text in the document said : "Sample text with special
>>> >> characters :) and such"
>>> >> It's tokenized into
>>> >>  'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi',
>>> >> 'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc',
>>> >> 'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:',
>>> >> 's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'.
>>> >> The above is output from my tokenizer so there's nothing wrong with
>>> >> creating trigrmas. However, when I check the index with lukeall, all
>>> the
>>> >> other trigrams are indexed correctly except for the terms ':)a' and
>>> ')an'.
>>> >> Since the missing indexes are related to lucene's special characters,
I
>>> >> don't think it's got to do with my custom code.
>>> >>
>>> >> I only changed analyser in the IndexFiles.java from demo to index the
>>> file.
>>> >> Honestly, I can't locate even the exact class in which the problem is
>>> >> caused. I'm only guessing IndexWriterConfig or IndexWriter is
>>> discarding
>>> >> the terms with leading special characters.
>>> >>
>>> >> I hope the above infromation helps.
>>> >>
>>> >> 2013/1/11 Ian Lea <ian.lea@gmail.com>
>>> >>
>>> >>> QueryParser has a setAllowLeadingWildcard() method.  Could that
be
>>> >>> relevant?
>>> >>>
>>> >>> What version of lucene?  Can you post some simple examples of what
>>> >>> does/doesn't work? Post the smallest possible, but complete, code
that
>>> >>> demonstrates the problem?
>>> >>>
>>> >>>
>>> >>> With any question that mentions a custom version of something, that
>>> >>> custom version has to be the prime suspect for any problems.
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Ian.
>>> >>>
>>> >>>
>>> >>> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksrb92@gmail.com>
>>> wrote:
>>> >>> > Hi.
>>> >>> >
>>> >>> > I've created a custom analyzer that treats special characters
just
>>> like
>>> >>> any
>>> >>> > other. The index works fine all the time even when the query
>>> includes
>>> >>> > special characters, except when the special characters come
to the
>>> >>> begining
>>> >>> > of the query.
>>> >>> >
>>> >>> > I'm using spanTermQuery and wildCardQuery, and they both seem
to
>>> suffer
>>> >>> the
>>> >>> > same issue with queries begining with special characters. Is
it a
>>> >>> > limitation of Lucene or am I missing something?
>>> >>> >
>>> >>> > Thanks
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message