lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: Extracting data from Lucene index files
Date Thu, 21 Dec 2006 06:25:51 GMT
Using term vectors means passing on the terms too many times - i.e
- loop on terms
- - loop on docs of a term
- - - loop on terms of a doc

Would something like this be better:
    do {
      System.out.println(tenum.term()+" appears in "+tenum.docFreq()+"
docs!");
      TermDocs td = reader.termDocs(tenum.term());
      do {
        System.out.println("  In doc id: "+td.doc() + " it appears: " +
td.freq()+ " times");
      } while (td.next());
    } while (tenum.next());


Also, you can skip faster to a certain doc (id) or certain term using the
skipTo() methods.

Doron

Venkateshprasanna <prasannahmv@yahoo.co.in> wrote on 19/12/2006 19:20:52:
>
> > Take a look at TermDocs and TermEnum.
>
> I need to get the frequency of each word in each of the documents I have
> indexed.
>
> This is what I could do with TermEnums and TermDocs. For each Term from
> TermEnum, I have instantiated a TermsDoc and for each doc, I am trying to
> get the frequency of the Term.
>
>     IndexReader ir = IndexReader.open("index file");
>     TermEnum terms = ir.terms();
>     while(terms.next()) {
>         TermDocs docs = ir.termDocs(terms.term());
>
>         while(docs.next()) {
>             TermFreqVector tfv =
ir.getTermFreqVector(docs.doc(),"contents");
>             String indexTerms[] = tfv.getTerms();
>             int indexFreqs[] = tfv.getTermFrequencies();
>
>             for(int i = 0; i<indexTerms.length; i++) {
>                System.out.println(indexTerms[i]+" "+indexFreqs[i]);
>             }
>          }
>      }
>
> But there is no way of getting the frequency of only 'that' term in
'that'
> document. I have to get the entire vector. This puts the loop in
jeopardy.
> How can I overcome this?
>
> --
> View this message in context: http://www.nabble.com/Extracting-data-
> from-Lucene-index-files-tf2813318.html#a7984092
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message