lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: Getting Per Document Frequencies in Apache Lucenenet 4.8.0.0
Date Fri, 05 May 2017 10:36:55 GMT
You don't need all the code after TopDocs topDocs. Just access
topDocs.TotalHits
and make sure your query is a PhraseQuery, this is how you will know all
hits are the phrases you where searching for.

--

Itamar Syn-Hershko
Freelance Developer & Consultant
Elasticsearch Partner
Microsoft MVP | Lucene.NET PMC
http://code972.com | @synhershko <https://twitter.com/synhershko>
http://BigDataBoutique.co.il/

On Wed, Apr 5, 2017 at 12:52 AM, William Young <wyoung@streetdiligence.com>
wrote:

> I'm using this version of Lucenet: https://github.com/apache/lucenenet
>
> I'm trying to get the number of phrase matches per document using a
> PhraseQuery and an ExactPhraseScorer like so:
>
> // Some phraseQuery defined here
>
> using (IndexReader indexReader =
> DirectoryReader.Open(IndexerJob.LuceneDirectory))
> {
> IndexSearcher indexSearcher = new IndexSearcher(indexReader);
>
> TopDocs topDocs = indexSearcher.Search(masterQuery, _MAXSEARCHRESULTS);
> var weight = phraseQuery.CreateWeight(indexSearcher);
>
> var scorers = indexReader.Leaves.Select(o => weight.Scorer(o,
> o.AtomicReader.LiveDocs)).Where(o => o != null);
> foreach (var scorer in scorers)
> {
> while (scorer.NextDoc() != DocIdSetIterator.NO_MORE_DOCS)
> {
> int doc = scorer.DocID();
> int freq = scorer.Freq();
> Console.WriteLine("Document {0} contains {1} matches", doc, freq);
> }
> }
> }
>
> But when I call scorer.NextDoc(), it always returns
> DocIdSetIterator.NO_MORE_DOCS, so the code in the while loop is never
> executed. I tried this with a TermQuery instead of a PhraseQuery, and it
> works fine. So the problem is with the implementation of PhraseQuery and
> the ExactPhraseScorer.
>
> I looked at the source code, and there seems to be a function in
> ExactPhraseScorer:
>
> private int PhraseFreq() { ... }
>
> That is responsible for the calculation of the counts per document. Also
> involved are the int[]'s Counts and Gens, but I don't really understand
> what this is doing well enough to diagnose it.
>
> Any ideas?
>
> William
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message