lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lucene-...@jakarta.apache.org
Subject [Jakarta Lucene Wiki] New: IndexingOtherLanguages
Date Thu, 08 Jul 2004 13:27:24 GMT
   Date: 2004-07-08T06:27:24
   Editor: 128.230.38.21 <>
   Wiki: Jakarta Lucene Wiki
   Page: IndexingOtherLanguages
   URL: http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages

   no comment

New Page:

= How To Index non-English Languages using Lucene =

Lucene is a Java based, UNICODE-compatible library for integrating searching into applications.
 [[BR]]

With a little extra effort, it is quite easy to index and search non-English language based
documents (and even search non-English based documents using English!)

This document will not go into the details of how to setup Lucene to index and search (using
readers, etc.), those are best covered in other pages such as IntroductionToLucene and other
HowTo tutorials, as well as many excellent articles available online.  It is also assumed
the reader understands how the Lucene Analyzer works (if not, see IntroductionToLucene and
AnalysisParalysis.)

There are several key items you will need to consider when indexing  

 1. Know the encoding of the documents you wish to index.  Java assumes the native encoding
when reading in files unless you tell it otherwise.  To create a Reader that supports reading
in other encodings, see [http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStreamReader.html
InputStreamReader].  I find it easiest to convert all of my files to UTF-8 before indexing,
and then I read them in by doing:[[BR]]
    `Reader reader = new InputStreamReader(new FileInputStream("path to file"), "UTF-8");`
Note:  The demo supplied with Lucene does not support UTF-8 out of the box.  You will have
to modify it.

 2. Identify the Analyzer you will use or write your own if none exists.  There are many great
analyzers available that will index a wide variety of languages.  See [http://jakarta.apache.org/lucene/docs/lucene-sandbox/
Sandbox] for some.  Otherwise, look around the web.  If you are writing your own, consider
donating it to the Lucene Sandbox so that others can benefit from your brilliance.  See item
3. below for what is needed in a custom analyzer.
     'Put example of writing an Analyzer here'

 3. The key to proper analysis is to identify what you want your final tokens to be.  Do you
want them tokenized, stemmed, lowercased, all stop words removed, etc.  With non-English languages,
many people have a hard time finding tokenizers and stemmers for the language they are interested
in.  There are many great sites out there that provide solutions to these problems, one just
needs to look.  Often times, a simple google search for something like "arabic tokenizer"
will do the trick.  Other times, you may need to dig into some academic papers to find a description
of the problem.  Another great resource is the Lucene User mailing list archives.  Chances
are you aren't the first one to tackle the language.

Once you have your Analyzer setup and your documents indexed, take a look at the Index using
[http://www.getopt.org/luke/ Luke]

Searching is just as in the English case.  Make sure you use the same analyzer you did for
indexing when analyzing your search.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message