Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 62E9C11F97 for ; Thu, 7 Aug 2014 12:47:34 +0000 (UTC) Received: (qmail 95723 invoked by uid 500); 7 Aug 2014 12:47:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 95658 invoked by uid 500); 7 Aug 2014 12:47:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 95646 invoked by uid 99); 7 Aug 2014 12:47:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Aug 2014 12:47:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aivykarter@gmail.com designates 209.85.215.53 as permitted sender) Received: from [209.85.215.53] (HELO mail-la0-f53.google.com) (209.85.215.53) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Aug 2014 12:47:27 +0000 Received: by mail-la0-f53.google.com with SMTP id gl10so3399322lab.26 for ; Thu, 07 Aug 2014 05:47:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=L6Qel09zMCnaBmLjNiQQyBMfP6YWOxNrlIqrama/jQ0=; b=Yxj2hiRMXlesRCwRFO4uvCJuFBJfxv0vF5399RIDU2NWfa01buINmTf9NVN7BNEhj2 /EUDQuEZOyb+K5EBNBwNo1xwodIRa6EddFlT/+sGVi14prrZ7PCojgTz+3waazd5DT1/ Cb3YGOmx5yHi74WtgxXzQAQCVf7kZ/zy1KW5DodT9bfdV2f+hovDBn3WR21+E99fYTdB LdFR0ZqoFwt9Ms/saJTtxZ3h8a6mFI0CDQ5Ic58umZ2BRPC1PkAD1e8K+5tkbCRl64vr 3cdHEg2gabzVlSxSvLAaxUMdcgGRc1TPmlDgXVTxQ0A+j7RPbhjktAKzME4hMv7ywL0C Nh1w== X-Received: by 10.112.165.68 with SMTP id yw4mr14909968lbb.5.1407415626020; Thu, 07 Aug 2014 05:47:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.152.114.33 with HTTP; Thu, 7 Aug 2014 05:46:45 -0700 (PDT) In-Reply-To: References: From: Bianca Pereira Date: Thu, 7 Aug 2014 13:46:45 +0100 Message-ID: Subject: Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency To: java-user Content-Type: multipart/alternative; boundary=001a1133a5ae70c10a0500097cfb X-Virus-Checked: Checked by ClamAV on apache.org --001a1133a5ae70c10a0500097cfb Content-Type: text/plain; charset=UTF-8 Hi Jack, Thank you very much. I just changed for the StandardAnalyzer and it is working as I would like. But there is something I still cannot understand. If I use the same analyzer for indexing and for searching, the same term should be parsed in the same way in both moments, shouldn't it? It is why I still don't understand why the EnglishAnalyzer was not working. Any idea on that? Best Regards, Bianca 2014-08-07 12:40 GMT+01:00 Jack Krupansky : > Generally, the standard analyzer will be a better choice, unless you have > some special need. > > A language-specific analyzer will include stemming. The English analyzer > includes the Porter stemmer. > > Generally, you need to apply a compatible analyzer to query terms to match > the index, or you need to manually filter your query terms. Sounds like > maybe a term got stemmed. > > -- Jack Krupansky > > -----Original Message----- From: Bianca Pereira > Sent: Thursday, August 7, 2014 7:28 AM > To: java-user@lucene.apache.org > Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency > > > Hi, > > I am new in the list and I have been working on a problem for some time > already. I would like to know if someone has any idea of how I can solve > it. > > Given a term, I want to get the term frequency in a lucene document. When > I use the WhiteSpaceAnalyzer my code works properly but when I use the > EnglishAnalyzer it returns 0 as frequency for any term. > > In order to get the term appearing both as "term" or "term," in the text > the EnglishAnalyzer is the best one to be used (I suppose). > > Any help is more than welcome. > > Best Regards, > Bianca > > ---------------------------- > Here is my code: > > TO INDEX > > public class LuceneDescriptionIndexer implements Closeable { > > private IndexWriter descWriter; > > > public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer > analyzer) > > throws IOException { > > openIndex(luceneDirectory, analyzer); > > } > > private void openIndex(Directory directory, Analyzer analyzer) throws > IOException { > > IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig. > INDEX_VERSION, analyzer); > > descWriter = new IndexWriter(directory, descIwc); > > } > > public void indexDocument(String id, String text) throws IOException { > > IndexableField idField = new StringField("id",id,Field.Store.YES); > > FieldType fieldType = new FieldType(); > > fieldType.setStoreTermVectors(true); > > fieldType.setStoreTermVectorPositions(true); > > fieldType.setIndexed(true); > > fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS); > > fieldType.setStored(true); > > > > Document doc = new Document(); > > doc.add(idField); > > doc.add(new Field("description", text, fieldType)); > > > > descWriter.addDocument(doc); > > } > > @Override > > public void close() throws IOException { > > descWriter.commit(); > > descWriter.close(); > > } > > } > > > TO QUERY > > public class LuceneTermStatistics implements TermKBStatistics { > > > private IndexReader luceneIndexReader; > > private Analyzer analyzer; > > private IndexSearcher searcher; > > > public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) { > > this.luceneIndexReader = reader; > > this.analyzer = analyzer; > > this.searcher = new IndexSearcher(reader); > > } > > /** > > * Create an instance of LuceneTermStatistics from the Config options. > > */ > > public static LuceneTermStatistics configureInstance(String indexPath, > Analyzer analyzer) > > throws IOException { > > FSDirectory index = FSDirectory.open(new File(indexPath)); > > DirectoryReader indexReader = DirectoryReader.open(index); > > return new LuceneTermStatistics(indexReader, analyzer); > > } > > @Override > > public int getTermFrequency(String term, String id) > > throws Exception { > > int docId = getDocId(id); > > // Get the vector with the frequency for the term in all documents > > DocsEnum de = MultiFields.getTermDocsEnum( > > luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader), > "description", > > new BytesRef(term)); > > // Get the frequency for the document of interest > > if (de != null) { > > int docNo; > > while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) { > > if(docNo == docId) > > return de.freq(); > > } > > } > > return 0; > > } > > > private int getDocId(String id) throws IOException { > > BooleanQuery idQuery = new BooleanQuery(); > > idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST); > > > TopScoreDocCollector collector = TopScoreDocCollector.create(1, false); > > searcher.search(idQuery, collector); > > TopDocs topDocs = collector.topDocs(); > > if (topDocs.totalHits == 0) > > return -1; > > return topDocs.scoreDocs[0].doc; > > } > > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --001a1133a5ae70c10a0500097cfb--