lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wal...@Cyveillance.com
Subject RE: TermFreqVector Beginner Question
Date Wed, 28 Jul 2004 21:38:58 GMT
Are you certain that you are storing the field "contents" in your documents,
not just tokenizing...

If you use the overloaded method that takes a Reader you lose the content.

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@syr.edu]
Sent: Wednesday, July 28, 2004 5:35 PM
To: lucene-user@jakarta.apache.org
Subject: Re: TermFreqVector Beginner Question


Can you post the whole section of related code?  Sounds like you are doing
things right.  

In the Lucene source code, there is a file called TestTermVectors.java, take
a look at that and see how your stuff compares.  I ran the test against the
HEAD and it worked.

>>> matt@thebasement.com 07/28/04 04:51PM >>>

Howdy,

I am new to Lucene and thus far I am very impressed.  Thanks to all who have
worked on this project!

I am working on a project where I want to do the following:

1.) Index a bunch of document.
2.) Pluck out one of the doucments by Lucene document number
3.) Get a term frequency for that document

After some digging and playing I came across this method...

   IndexReader.getTermFreqVector(int docNumber, String field)

This is exactly what I want.  So I ran the IndexFiles demo program with some
test documents and started poking at the index with an IndexReader. But when
I
called

   IndexReader.getTermFreqVector(someDocNumber,"contents")

I get NULL back.  After a little more digging I find that for a TermVector
to
exist the Field has to have the TermVector flag set.  So I changes some
lines
in the demo FileDocument.Document method to:

    FileInputStream is = new FileInputStream(f);
    Reader reader = new BufferedReader(new InputStreamReader(is));
    doc.add(Field.Text("contents", reader.toString(),true));

with the "true" parameter causing the new Field to turn on the
storeTermVector
flag, right? So then I reindex and get the same results - getTermFreqVector
returns NULL.  So I inspect the field list of the Document from the index:

   Document d = ir.document(td.doc());
   System.out.println("  Path: "+d.get("path"));
   for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
   {
      System.out.println(((Field)e.nextElement()).toString());
   }

and I discover that there is now NO "contents" Field.  If I change the
paramter
in Field.Text to false, I get a "contents" Field but no TermVector.  To date
I
haven't been able to figure out how to get a TermFreqVector at all.

What am I missing?

I have looked at the documents - all the tutorials I have found just cover
the
basics.

I have read the news group postings related to "TermVectors" and
"TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is
great".  So how do they know?  Where can I learn about this?  Are there any
more
complete user tutorials/references that cover TermVector features?

Oh, I am using the 1.4 Lucene release in case it matters.

Thanks in advance,

Matt Galloway
Tulsa, Oklahoma


(BTW, I also tired Field.UnStored with the same results.)



-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/ 

----- End forwarded message -----




-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
For additional commands, e-mail: lucene-user-help@jakarta.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message