lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pekka Nykyri <pnyk...@cs.joensuu.fi>
Subject Re: Special characters prevent entity being indexed
Date Wed, 19 Nov 2008 11:18:34 GMT

Thanks for the quick answer!

I haven't specified the analyzer so it should be the StandardAnalyzer. I 
forgot to mention that I'm using Lucene via Hibernate seach where I can 
easily define the fields in the hibernate POJO-classes. But as far as I 
know this shouldn't change things that much because I can use the core 
Lucene.

And I've used Luke already and the indexed special characters are 
represented as "&#161;"(¡) and "&#191;" (¿) in the index.

But the analyzer should have nothing to do with the problem currently 
because the problem is that, those entities that start with "¿" don't get 
indexed at all. And some of those starting with "¡" get indexed and some 
don't. Currently 29 entities don't get indexed at all (8900 in total).

I don't need to be able to search those special characters. I just need 
those entities getting indexed. The other information in those entities is 
more important and it's the names (starting with those special characters) 
that seems to make those entities not getting indexed.

Could I fix this using some analyzer during indexing? Actually I tried 
using custom analyzer with "ISOLatin1AccentFilter()" but it didn't change 
anything. In hibernate search the analyzer is spesified in a property file 
or in the POJO-classes but I didn't seem to get it to work. The text went 
to the index exactly the same way (when I see it with Luke) like before 
and the same entities were still missing.

Good solution for me would be that those special character would get 
deleted alltogether from the index so maybe then they wouldn't cause any 
trouble. Like "¡Fantástico!- blaaba" would be perfectly okay looking like 
"Fantastico- blaaba".

Thanks again in advance,
pn

On Tue, 18 Nov 2008, Erick Erickson wrote:

> What analyzer are you using at index and search time? Typical problems
> include:
> using an analyzer that doesn't understand accented chars (StandardAnalyzer
> for instance)
> using a different anlyzer during search and index.
>
> Search the user list for "accent" and you'll find this kind of problem
> discussed,
> and if that doesn't help we need to know what analyzers you are using and
> what behavior you really want. Typically, for instance, *requiring* a user
> to
> type the upside-down exclamation point to get a match on this field would
> be considered incorrect.
>
> Also, you'd be helped a lot be getting a copy of Luke and examining your
> index
> to see exactly what's been indexed, it'll reveal a lot.
>
> Best
> Erick
>
> On Tue, Nov 18, 2008 at 10:05 AM, Pekka Nykyri <pnykyri@cs.joensuu.fi>wrote:
>
>> Hi!
>>
>> I'm having problems with entities including special characters (Spanish
>> language) not getting indexed.
>>
>> I haven't been able to find the the reason why some entities get indexed
>> while some don't.
>>
>> I have 3 fields that (currently) hold the same value. The value for the
>> fields is example "¡Fantástico!- blaaba". Then when I change ONE of the
>> three values to "¡Fantástico! - blaaba", the entity gets indexed. So
>> chanching only one field makes it to index.
>>
>> But the bigger problem with this is, that I have almost (other fields are
>> almost similar and I don't think they cause the problem) similar entity,
>> with exactly the same three "¡Fantástico!- blaaba" -fields and it gets
>> indexed normally. Even though the "critical" fields are exactly the same.
>>
>> And also all entities where three fields start with "upside down ?"-mark
>> doesn't get indexed.
>>
>> I'm really confused with the problem because I don't seem to be able to
>> find any logic some entities not being indexed even though they are similar
>> to some other. And changing only one value of the three makes it index.
>>
>> Sorry for a really messy message but I just can't explain it more clearly
>> now.
>>
>> Thanks in advance,
>> pn
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>

Mime
View raw message