lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: Lucene: how to get frequency of Boolean query
Date Mon, 27 Dec 2010 02:58:42 GMT
do you mean get tf of the hited documents when doing search?
it's a difficult problem because only TermScorer has TermDocs and using tf
in score() function.
but this we can't know whether these doc is selected because we use a
priorityQueue in TopScoreDocCollector

  public void collect(int doc) throws IOException {
      float score = scorer.score();

      // This collector cannot handle these scores:
      assert score != Float.NEGATIVE_INFINITY;
      assert !Float.isNaN(score);

      totalHits++;
      if (score <= pqTop.score) {
        // Since docs are returned in-order (i.e., increasing doc Id), a
document
        // with equal score to pqTop.score cannot compete since HitQueue
favors
        // documents with lower doc Ids. Therefore reject those docs too.
        return;
      }
      pqTop.doc = doc + docBase;
      pqTop.score = score;
      pqTop = (ScoreDoc) pq.updateTop();
    }
    Because the scorer in BooleanScorer2 is
ConjuctionScorer(DisjuctionScorer). it can not get tf.
    If you want to do this. you have to modify many codes
2010/12/25 Ranjit Kumar <Ranjit.Kumar@otssolutions.com>

>  Hi,
>
>     Merry Christmas!!
>
>
>
> In case of Boolean query like *’sql AND server’ *.
>
> I am using parser to get correct document containing both *sql* and *
> server*. Inside for loop in below code I get correct documented and to get
> frequency I need to sum frequency of ‘sql’ and ‘server’ individually with
> the help of *termDocs.read(). *
>
> As I am searching through millions of document. So, to calculate frequency
> it takes about *160 Second* for *80k* document.
>
>
>
> Is there any way to get frequency of ‘*Boolean query’ *directly without
> manipulation. As it takes lots of time. In case of single term  and phrase
> query, I got frequency for 80k document within 10 Seconds.
>
>
>
> *QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field,
> analyzer);*
>
> *parser.setDefaultOperator(QueryParser.AND_OPERATOR);*
>
> *Query query = parser.parse(“sql AND server”);*
>
> *TopDocs docs = searcher.search(query, null, n);*
>
> *TermDocs termDocs = reader.termDocs();*
>
> *                   termDocs.seek(new Term(field,
> query.toString(field).split(" ")[0].toLowerCase()));*
>
> *                    int[] ints = new int[docs.totalHits];*
>
> *                    int[] ints1 = new int[docs.totalHits];*
>
> *                    termDocs.read(ints, ints1);*
>
> *                    List lsts = Arrays.asList(ArrayUtils.toObject(ints));
> *
>
> *                    List lsts1 =
> Arrays.asList(ArrayUtils.toObject(ints1));*
>
> * *
>
> *                    termDocs.seek(new Term(field,
> query.toString(field).split(" ")[1].toLowerCase()));*
>
> *                    int[] inta = new int[docs.totalHits];*
>
> *                    int[] inta1 = new int[docs.totalHits];*
>
> *                    termDocs.read(inta, inta1);*
>
> *                    List lsta = Arrays.asList(ArrayUtils.toObject(inta));
> *
>
> *                    List lsta1 =
> Arrays.asList(ArrayUtils.toObject(inta1));*
>
> *                *
>
> *                    int totalFreq = 0;*
>
> *                    int docId = -1;*
>
> *                    String a = null;*
>
> *                    int contId = 0;*
>
> *                    String path=null;*
>
> *for (int i = 0; i < docs2.scoreDocs.length; i++) {                  *
>
> *                     docId = docs2.scoreDocs[i].doc;*
>
> *                     path=reader.document(docId).get("path");*
>
> *                     a = path.substring(path.lastIndexOf("\\") + 1,
> path.lastIndexOf("."));*
>
> *                        try {*
>
> *                              if(a.indexOf("e")>-1){*
>
> *                                     contId =
> Integer.parseInt(a.substring(0, a.length()-1));*
>
> *                              }else{*
>
> *                                contId = Integer.parseInt(a);*
>
> *                              }*
>
> *                            } catch (Exception e) {*
>
> *//                                e.printStackTrace();*
>
> *                            }*
>
> *                        *
>
> *                        if ((lsts.indexOf(docId) > -1 &&
> lsta.indexOf(docId) > -1) || (lsta.indexOf(docId) > -1) &&
> lsts.indexOf(docId) > -1) {*
>
> *                            totalFreq = (Integer)
> lsts1.get(lsts.indexOf(docId)) + (Integer) lsta1.get(lsta.indexOf(docId));
> *
>
> *                        } else if (lsts.indexOf(docId) > -1) {*
>
> *                            totalFreq = (Integer)
> lsts1.get(lsts.indexOf(docId));*
>
> *                        } else if (lsta.indexOf(docId) > -1) {*
>
> *                            totalFreq = (Integer)
> lsta1.get(lsta.indexOf(docId));*
>
> *                        }*
>
> * *
>
> *
> w.write(contId+"\t"+ID+"\t"+totalFreq+"\t"+reader.document(docId).get("path")+"\n");
> *
>
> *                    }*
>
>
>
>
>
> Thanks & Regards,
>
> *Ranjit Kumar                       ***
>
> Associate Software Engineer
>
>
>
> [image: cid:image002.jpg@01CB7089.C0069B40]
>
>
>
> *US:*       +1 408.540.0001
>
> *UK:*       +44 208.099.1660
>
> *India:*   +91 124.474.8100 | +91 124.410.1350
>
> *FAX:*     +1 408.516.9050
>
> http://www.otssolutions.com
>
>
> ===================================================================================================
> Private, Confidential and Privileged. This e-mail and any files and
> attachments transmitted with it are confidential and/or privileged. They are
> intended solely for the use of the intended recipient. The content of this
> e-mail and any file or attachment transmitted with it may have been changed
> or altered without the consent of the author. If you are not the intended
> recipient, please note that any review, dissemination, disclosure,
> alteration, printing, circulation or Transmission of this e-mail and/or any
> file or attachment transmitted with it, is prohibited and may be unlawful.
> If you have received this e-mail or any file or attachment transmitted with
> it in error please notify OTS Solutions at info@otssolutions.com===================================================================================================
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message