lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From slagrau...@ccr.fr
Subject Re: special character with lucene
Date Mon, 28 Feb 2005 15:55:28 GMT
                                                                           
             Philipp_Breuss@so                                             
             nydadc.com                                                    
                                                                         A 
             28/02/2005 16:36          "Lucene Developers List"            
                                       <lucene-dev@jakarta.apache.org>     
                                                                        cc 
             Veuillez répondre                                             
                     à                                               Objet 
                  "Lucene              Re: special character with lucene   
             Developers List"                                              
             <lucene-dev@jakar                                             
              ta.apache.org>                                               
                                                                           
                                                                           
                                                                           












Usually the text is in one specific language. English, German, Spanish,
French, ...
However, I dont really have a runtime identifier which language it is. I
could only pick a few words and decide from there (?) - if this is a good
idea?

Is there a tool part of lucene that helps deciding what language a
specific text is?

      There's a patch contributed by JF Halleux but I haven't tried it yet
: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763


In a simple test I noticed that StandardAnalyzer removes special
characters like ä, ö, ... If I leave the characters the way they are, I
don't find f.e the German word "Äpfel" anymore. So it looks like there are
only two solutions:

a)      - decide which language it is by choosing a few words from the
text
        - use the language specific analyzer; where do I find Spanish and
Frensh analyzer?


      The FrenchAnalyzer is in the Sandbox, don't know about the spanish
one.


b)      - replace each special character (ä, ö, ...) with some code &#239,
.... There is no stemming then.


      the analyzer will translate everything for you.
      However, I think that you'll have to use the same analyzer to search,
so you'll be able to search for one language at a time
      (which seems not so bad after all)

Any help is appreciated,
Greetings,
Philipp






Erik Hatcher <erik@ehatchersolutions.com>
28.02.2005 16:17
Bitte antworten an
"Lucene Developers List" <lucene-dev@jakarta.apache.org>


An
"Lucene Developers List" <lucene-dev@jakarta.apache.org>
Kopie

Thema
Re: special character with lucene








On Feb 28, 2005, at 10:01 AM, Philipp_Breuss@sonydadc.com wrote:
> Hello,
> I would like to build a search engine using several different
> languages -
> f.e. Spanish names, French names, ...

Will your text be a mix of languages within a single field?  Or would
each document (or field) be a single language?

> - Using a different analyzer for each language would be one solution.

You will most likely have to use a different analyzer for each
language, though that depends on the answers to the above.

> - But how about replacing each special character (Umlaute, ...ä, ö,
> ...)
> with its html special character before indexing and doing the same with
> each search query before searching??

An HTML entity is more than one character.  The simplest is to leave
the characters as-is, in Unicode.

> This seems to me the simplest approach to handling this issues - ?
>
> What are the drawbacks? No Stem search? Other considerations?

Stemming is language-specific, which factors into your analyzer(s)
choices.

                 Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message