lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hankyu Kim <gksr...@gmail.com>
Subject Re: Query beginning with special characters
Date Mon, 14 Jan 2013 07:42:32 GMT
I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so
setAllowLeadingWildcard() is irrelevant.
I also realised the issue wasn't with querying, but it was indexing whihch
left the terms with leading special character out.

My goal was to do a fuzzymatch by creating a trigram index. The idea is to
tokenize the documents into trigrams, not by words during indexing and
searching so lucene can search for part of a word or phrase.

Say the original text in the document said : "Sample text with special
characters :) and such"
It's tokenized into
 'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi',
'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc',
'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:',
's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'.
The above is output from my tokenizer so there's nothing wrong with
creating trigrmas. However, when I check the index with lukeall, all the
other trigrams are indexed correctly except for the terms ':)a' and ')an'.
Since the missing indexes are related to lucene's special characters, I
don't think it's got to do with my custom code.

I only changed analyser in the IndexFiles.java from demo to index the file.
Honestly, I can't locate even the exact class in which the problem is
caused. I'm only guessing IndexWriterConfig or IndexWriter is discarding
the terms with leading special characters.

I hope the above infromation helps.

2013/1/11 Ian Lea <ian.lea@gmail.com>

> QueryParser has a setAllowLeadingWildcard() method.  Could that be
> relevant?
>
> What version of lucene?  Can you post some simple examples of what
> does/doesn't work? Post the smallest possible, but complete, code that
> demonstrates the problem?
>
>
> With any question that mentions a custom version of something, that
> custom version has to be the prime suspect for any problems.
>
>
> --
> Ian.
>
>
> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksrb92@gmail.com> wrote:
> > Hi.
> >
> > I've created a custom analyzer that treats special characters just like
> any
> > other. The index works fine all the time even when the query includes
> > special characters, except when the special characters come to the
> begining
> > of the query.
> >
> > I'm using spanTermQuery and wildCardQuery, and they both seem to suffer
> the
> > same issue with queries begining with special characters. Is it a
> > limitation of Lucene or am I missing something?
> >
> > Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message