From java-user-return-34313-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Wed Jun 04 12:59:45 2008 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 64812 invoked from network); 4 Jun 2008 12:59:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Jun 2008 12:59:45 -0000 Received: (qmail 41033 invoked by uid 500); 4 Jun 2008 12:59:41 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 41004 invoked by uid 500); 4 Jun 2008 12:59:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 40993 invoked by uid 99); 4 Jun 2008 12:59:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2008 05:59:41 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [200.229.128.67] (HELO out02.picture.com.br) (200.229.128.67) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2008 12:58:44 +0000 Received: from sub20.acc.br (unknown [200.186.89.186]) by out02.picture.com.br (Postfix) with ESMTP id EB3FA51D02 for ; Wed, 4 Jun 2008 09:58:59 -0300 (BRST) Message-ID: <48469192.3020605@accurate.com.br> Date: Wed, 04 Jun 2008 09:58:58 -0300 From: "Lucas F. A. Teixeira" User-Agent: Thunderbird 2.0.0.14 (Macintosh/20080421) MIME-Version: 1.0 To: java-user@lucene.apache.org X-Picturemail: Y Subject: Re: Question about indexing (BrazilianAnalyzer) References: <9d2777b60806031251h8f6e20fn15f2adab1e02abea@mail.gmail.com> In-Reply-To: <9d2777b60806031251h8f6e20fn15f2adab1e02abea@mail.gmail.com> Content-Type: multipart/alternative; boundary="------------020706070308050706010808" X-Virus-Checked: Checked by ClamAV on apache.org --------------020706070308050706010808 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Are you using ISOLatin1AccentFilter ? []s, Lucas Frare A. Teixeira lucas.teixeira@accurate.com.br Tel: +55 11 3660.1622 - R3018 Vinicius Carvalho escreveu: > Hello there! I'm indexing documents using the BrazilianAnalyzer, and I've > noticed that many words are not being indexed. I store and index the entire > doc (I'm doing this in order to present the fragments on the results, don't > know if its the best way, mostly on large docs, any ideas?). Well using luke > to check the index I open the stored doc, and its contents contains 17 > occurrences of the word "herança" for instance. But, there's no term for > this word or it stemm version: "heranc", so searching for this word would > not return a result for this document. > > I'm pretty sure I'm missing something on the indexing process: > > > try { > doc.add(new > Field("contents",docText,Field.Store.YES,Field.Index.TOKENIZED,Field.TermVector.YES)); > IndexWriter writer = new > IndexWriter("/java/lucene/portal/cms",new BrazilianAnalyzer()); // gotta > improve this latter > writer.addDocument(doc); > writer.close(); > } > > > So, why would these word (and others) not being indexed? > > Regards > --------------020706070308050706010808--