lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From luther blisset <sabri.br...@gmail.com>
Subject Re: Removing diacritics with ISOLatin1AccentFilter
Date Fri, 24 Jul 2009 11:38:06 GMT

I'm trying to index all the words without accent.
I do the same when I'm querying, I remove the accent and lower case the
search term.
Why should I pass the string through the analyzer?
or what is wrong if don't pass it through the analyzer?
and what are the benefits?
I'm just a newbie with Lucene..
Thanks a lot for your reply :]




Simon Willnauer wrote:
> 
> On Fri, Jul 24, 2009 at 11:41 AM, luther blisset<sabri.bruni@gmail.com>
> wrote:
>>
>> Hi folks,
>> I just upgrading Hibernate Search library of my app and so I had to
>> upgrade
>> Lucene too and pass from 2.2 to 2.4 version.
>> In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't
>> figure
>> how it works.
>> I use a TwoWayFieldBridge to index the data and this is my set method:
>>
>> public void set(String s, Object o, Document document, Field.Store store,
>> Field.Index index, Float aFloat){
>>
>>        //MyObject has a field name
>>        MyObject objectToIndex;
>>
>>        //casting from Object to MyObject
>>        try{
>>            objectToIndex = MyObject.class.cast(o);
>>        }catch(ClassCastException cEx ){}
>>
>>
>>
>>        if (objectToIndex.getName() != null) {
>>
>>            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
>> StandardTokenizer(new StringReader(objectToIndex.getName())));
>>            filter.removeAccents(objectToIndex.getName().toCharArray(),
>> objectToIndex.getName().length());
>>            Field name = new Field( "name",
>> String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
>> Field.Index.UN_TOKENIZED );
>>
>>            document.add(name);
>>        }
>> }
>>
> I do not really understand what you are trying to do. do you just
> wanna remove the accents from the string and index it without passing
> it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
> field value to an analyzer).
> do you wanna index this without an analyzer?!
> 
> If you pass an array to ISOLantin1AccentFilter#removeAccents() the
> processed chars will be written to an private internal char array
> inside the ISOLantin1AccentFilter. You can not use the removeAccents
> method just removing the accents. what you could do as a dirty
> workaround is the following:
> String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
>   ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
>       new Tokenizer(new StringReader(foo)){
>         private boolean isRead = false;
>         public Token next(final Token reusableToken) throws IOException {
>           if(isRead){
>             return null;
>           }
>           BufferedReader reader = new BufferedReader(this.input);
>           StringBuilder builder = new StringBuilder();
> 
>          char[] buffer = new char[1024];
>          int read = -1;
>          while((read = reader.read(buffer)) > 0){
>            builder.append(buffer, 0, read);
>          }
>          reusableToken.setTermText(builder.toString());
>          isRead = true;
>          return reusableToken;
>         }
>   });
>     Token t = filter.next();
>     String foo_without_accents = t.term();
>     System.out.println(foo_without_accents);
> yields: HAllo HAllo HAllo HAllo HAllo
> 
> 
> simon
>>
>> but it doesn't work. And if pass an accented word for the property
>> objectToIndex.getName(), it remains with accent :(
>> I think there is something wrong in my code when I create the new
>> instance
>> of ISOLatin1AccentFilter  but I can' t get it works properly.
>> Could someone help me?
>> thanks a lot
>> --
>> View this message in context:
>> http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24643036.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message