lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Arni <>
Subject Re: Question about indexing (BrazilianAnalyzer)
Date Wed, 04 Jun 2008 07:38:27 GMT

First of all please please always make sure that you use
exactly the same Analyser during indexing and searching.

I am not confident with the BrazilianAnalyzer, but I saw
in the source code that it does not use a ISOLatin1AccentFilter,
which replaces the accented characters (ç -> c).

Probably the problem is with this accents..
You can check this if you adapt the method tokenStream()
in the BrazilianAnalzyer by including the ISOLatin1AccentFilter
in the filter chain.


Vinicius Carvalho said the following on 03/06/08 20:51:
> Hello there! I'm indexing documents using the BrazilianAnalyzer, and I've
> noticed that many words are not being indexed. I store and index the entire
> doc (I'm doing this in order to present the fragments on the results, don't
> know if its the best way, mostly on large docs, any ideas?). Well using luke
> to check the index I open the stored doc, and its contents contains 17
> occurrences of the word "herança" for instance. But, there's no term for
> this word or it stemm version: "heranc", so searching for this word would
> not return a result for this document.
> I'm pretty sure I'm missing something on the indexing process:
> try {
>             doc.add(new
> Field("contents",docText,Field.Store.YES,Field.Index.TOKENIZED,Field.TermVector.YES));
>             IndexWriter writer = new
> IndexWriter("/java/lucene/portal/cms",new BrazilianAnalyzer()); // gotta
> improve this latter
>             writer.addDocument(doc);
>             writer.close();
>         }
> So, why would these word (and others) not being indexed?
> Regards

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message