lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject RE: TermFreqVector Beginner Question
Date Wed, 28 Jul 2004 21:38:58 GMT
Are you certain that you are storing the field "contents" in your documents,
not just tokenizing...

If you use the overloaded method that takes a Reader you lose the content.

-----Original Message-----
From: Grant Ingersoll []
Sent: Wednesday, July 28, 2004 5:35 PM
Subject: Re: TermFreqVector Beginner Question

Can you post the whole section of related code?  Sounds like you are doing
things right.  

In the Lucene source code, there is a file called, take
a look at that and see how your stuff compares.  I ran the test against the
HEAD and it worked.

>>> 07/28/04 04:51PM >>>


I am new to Lucene and thus far I am very impressed.  Thanks to all who have
worked on this project!

I am working on a project where I want to do the following:

1.) Index a bunch of document.
2.) Pluck out one of the doucments by Lucene document number
3.) Get a term frequency for that document

After some digging and playing I came across this method...

   IndexReader.getTermFreqVector(int docNumber, String field)

This is exactly what I want.  So I ran the IndexFiles demo program with some
test documents and started poking at the index with an IndexReader. But when


I get NULL back.  After a little more digging I find that for a TermVector
exist the Field has to have the TermVector flag set.  So I changes some
in the demo FileDocument.Document method to:

    FileInputStream is = new FileInputStream(f);
    Reader reader = new BufferedReader(new InputStreamReader(is));
    doc.add(Field.Text("contents", reader.toString(),true));

with the "true" parameter causing the new Field to turn on the
flag, right? So then I reindex and get the same results - getTermFreqVector
returns NULL.  So I inspect the field list of the Document from the index:

   Document d = ir.document(td.doc());
   System.out.println("  Path: "+d.get("path"));
   for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 

and I discover that there is now NO "contents" Field.  If I change the
in Field.Text to false, I get a "contents" Field but no TermVector.  To date
haven't been able to figure out how to get a TermFreqVector at all.

What am I missing?

I have looked at the documents - all the tutorials I have found just cover

I have read the news group postings related to "TermVectors" and
"TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is
great".  So how do they know?  Where can I learn about this?  Are there any
complete user tutorials/references that cover TermVector features?

Oh, I am using the 1.4 Lucene release in case it matters.

Thanks in advance,

Matt Galloway
Tulsa, Oklahoma

(BTW, I also tired Field.UnStored with the same results.)

This mail sent through IMP: 

----- End forwarded message -----

This mail sent through IMP: 

To unsubscribe, e-mail: 
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message