lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Special characters prevent entity being indexed
Date Wed, 19 Nov 2008 13:51:43 GMT
I'm going to have to punt on what Hibernate does/doesn't do since I have no
experience there.

But in general analyzers are very important. StandardAnalyzer, for instance,
tries
to recognize e-mail addresses. So it'll create some very interesting tokens,
some
that are unexpected unless you really know the guts of the tokenizer
underneath.

What I expect is really happening is that some of your tokens are getting
munged
in ways you don't expect and that they're getting indexed, but you're not
finding
the munged version because they're somewhere else in the index. Either that
or
you're getting some sort of error condition that you're throwing away.....

So here's the approach I would take. Make a list of your tokens that you're
having
trouble with and create an index with *only* those tokens, to see whether
they
really get into the index or not.

Make really, really sure you use the same analyzer BOTH at index and search
time. I've seen ISOLatin1AccentFilter() recommended, but I confess that I
don't know the underlying details.

Perhaps you could create your own filter/analyzer based on the analyzer of
your
choice, override the next() method and spit out the tokens that are produced
to examine the results closely.

Anyway, general advice that probably doesn't address your issue, I know. But
the best I have this morning. I'll appeal to people who understand Hibernate
now <G>.

Best
Erick

On Wed, Nov 19, 2008 at 6:18 AM, Pekka Nykyri <pnykyri@cs.joensuu.fi> wrote:

>
> Thanks for the quick answer!
>
> I haven't specified the analyzer so it should be the StandardAnalyzer. I
> forgot to mention that I'm using Lucene via Hibernate seach where I can
> easily define the fields in the hibernate POJO-classes. But as far as I know
> this shouldn't change things that much because I can use the core Lucene.
>
> And I've used Luke already and the indexed special characters are
> represented as "&#161;"(¡) and "&#191;" (¿) in the index.
>
> But the analyzer should have nothing to do with the problem currently
> because the problem is that, those entities that start with "¿" don't get
> indexed at all. And some of those starting with "¡" get indexed and some
> don't. Currently 29 entities don't get indexed at all (8900 in total).
>
> I don't need to be able to search those special characters. I just need
> those entities getting indexed. The other information in those entities is
> more important and it's the names (starting with those special characters)
> that seems to make those entities not getting indexed.
>
> Could I fix this using some analyzer during indexing? Actually I tried
> using custom analyzer with "ISOLatin1AccentFilter()" but it didn't change
> anything. In hibernate search the analyzer is spesified in a property file
> or in the POJO-classes but I didn't seem to get it to work. The text went to
> the index exactly the same way (when I see it with Luke) like before and the
> same entities were still missing.
>
> Good solution for me would be that those special character would get
> deleted alltogether from the index so maybe then they wouldn't cause any
> trouble. Like "¡Fantástico!- blaaba" would be perfectly okay looking like
> "Fantastico- blaaba".
>
> Thanks again in advance,
> pn
>
>
> On Tue, 18 Nov 2008, Erick Erickson wrote:
>
>  What analyzer are you using at index and search time? Typical problems
>> include:
>> using an analyzer that doesn't understand accented chars (StandardAnalyzer
>> for instance)
>> using a different anlyzer during search and index.
>>
>> Search the user list for "accent" and you'll find this kind of problem
>> discussed,
>> and if that doesn't help we need to know what analyzers you are using and
>> what behavior you really want. Typically, for instance, *requiring* a user
>> to
>> type the upside-down exclamation point to get a match on this field would
>> be considered incorrect.
>>
>> Also, you'd be helped a lot be getting a copy of Luke and examining your
>> index
>> to see exactly what's been indexed, it'll reveal a lot.
>>
>> Best
>> Erick
>>
>> On Tue, Nov 18, 2008 at 10:05 AM, Pekka Nykyri <pnykyri@cs.joensuu.fi
>> >wrote:
>>
>>  Hi!
>>>
>>> I'm having problems with entities including special characters (Spanish
>>> language) not getting indexed.
>>>
>>> I haven't been able to find the the reason why some entities get indexed
>>> while some don't.
>>>
>>> I have 3 fields that (currently) hold the same value. The value for the
>>> fields is example "¡Fantástico!- blaaba". Then when I change ONE of the
>>> three values to "¡Fantástico! - blaaba", the entity gets indexed. So
>>> chanching only one field makes it to index.
>>>
>>> But the bigger problem with this is, that I have almost (other fields are
>>> almost similar and I don't think they cause the problem) similar entity,
>>> with exactly the same three "¡Fantástico!- blaaba" -fields and it gets
>>> indexed normally. Even though the "critical" fields are exactly the same.
>>>
>>> And also all entities where three fields start with "upside down ?"-mark
>>> doesn't get indexed.
>>>
>>> I'm really confused with the problem because I don't seem to be able to
>>> find any logic some entities not being indexed even though they are
>>> similar
>>> to some other. And changing only one value of the three makes it index.
>>>
>>> Sorry for a really messy message but I just can't explain it more clearly
>>> now.
>>>
>>> Thanks in advance,
>>> pn
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message