lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Indexing accented characters, then searching by any form
Date Mon, 11 Feb 2008 15:56:52 GMT
I'm inferring that you need the original text for display purposes or some
such,
but want to search a "canonical" form. So the following may be totally
irrelevant if my inference is wrong.....

Indexed and stored are two very distinct things in Lucene. If you create
a field that is both stored and indexed, the indexed part goes through
the analyzer and the stored part does not. So you get the best of both
worlds, your search goes against the analyzed code but if you fetch
the field, it's in the original format.

I didn't understand this until after creating the first product using
Lucene, so
one of our production applications has some fields stored but not indexed
and
the *same* data indexed but not stored.. Siiiiggghhh.

Try this code to gain comfort. It uses the casing as a stand-in for accents,
but
you can easily adapt it to try your accented cases.

    public static void main(String[] args) throws Exception {
        try {
            IndexWriter iw = new IndexWriter("C:/test", new
StandardAnalyzer());
            Document doc = new Document();
            doc.add(new Field("blivet", "This is some Mixed Case Text",
Field.Store.YES, Field.Index.TOKENIZED));
            iw.addDocument(doc);
            iw.close();

            IndexSearcher search = new IndexSearcher("c:/test");
            QueryParser qp = new QueryParser("blivet", new
StandardAnalyzer());
            Query q = qp.parse("mixed"); // only matches if StandardAnalyzer
lower-cased the input.
            Hits hits = search.search(q);
            System.out.println("Count = " + Integer.toString(hits.length
()));
            System.out.println(search.getIndexReader().document(0).get("blivet"));
// Outputs mixed case stored field.
        } catch (Exception e) {
            System.err.println("Caught Exception");
            System.err.flush();
            e.printStackTrace();
        }
    }


Best
Erick

On Feb 11, 2008 10:00 AM, Cesar Ronchese <ronchese@hotmail.com> wrote:

>
> Hello, guys.
>
> I've searching the google to make the lucene performs accent-insensitive
> searches.
>
> All I could find is about the ISOLatin1AccentFilter class, which as far I
> could understand, it just removes the accented chars so I can store it in
> its unaccented form.
>
> What I would like to know is, is there a way to store the content in your
> original accented format, and make an accent-insensitive query with
> lucene?
> How?
>
> For example:
> Indexed word: usuário
> Terms typed by the user, to find the word above: usuário or usuario or
> usuãrio, etc.
>
> Thanks in advance.
> Cesar
> --
> View this message in context:
> http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p15412778.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message