lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Galloway <m...@thebasement.com>
Subject Re: TermFreqVector Beginner Question
Date Thu, 29 Jul 2004 15:31:23 GMT
Well, as one would expect most of the problems were me.  Here is what I
learned... (please comment on the accuracy of these statements).

1.) Setting storeTermVertor to true does nothing if store is false, i.e. 
   you must store the contents of a filed in order to retrieve TermVectors 
   for it later.  This may seem obvious to everyone else, but to a new user 
   this is anything but obvious as it is not documented anywhere that I have
   seen.  I think it would be very helpful to include this tidbit in the 
   Field class JavaDoc if not in the FAQ or some other place.

   I also think it would be helpful to prevent the user from combinations 
   of store and storeTermVector that don't make sense, namely store = false 
   and storeTermVector = true.  Maybe an exception or something.

2.) The following methods...

   Field.Text(String name, Reader value, boolean storeTermVector) 
   Field.UnStored(String name, String value, boolean storeTermVector) 

   DO NOT store the contents of the field and (based on my assumption in 
   point 1 and through observation) consequently DO NOT store TermVectors
   despite the value of their storeTermVector value.  If this is accurate,
   why do these methods exist?  This is very misleading to the new user.

3.) I am also new to Java so if you look at my earlier sample code you
   will see that I used "reader.toString()" where reader is a buffered file
   reader.  This of course is not the desired effect.  I have since rewritten
   the code to reflect the a string that contains the content of the file
   instead of some vector address thing.  This doesn't affect Lucene or
   term vectors, just my ego.

Once you understand that stor=true is ALSO a prerequisite for TermVectors (in
addition to storeTermVector=true) then everything works great.

Thanks for the help,

Matt Galloway


Quoting Grant Ingersoll <gsingers@syr.edu>:

> Can you post the whole section of related code?  Sounds like you are doing
> things right.  
> 
> In the Lucene source code, there is a file called TestTermVectors.java, take
> a look at that and see how your stuff compares.  I ran the test against the
> HEAD and it worked.
> 
> >>> matt@thebasement.com 07/28/04 04:51PM >>>
> 
> Howdy,
> 
> I am new to Lucene and thus far I am very impressed.  Thanks to all who
> have
> worked on this project!
> 
> I am working on a project where I want to do the following:
> 
> 1.) Index a bunch of document.
> 2.) Pluck out one of the doucments by Lucene document number
> 3.) Get a term frequency for that document
> 
> After some digging and playing I came across this method...
> 
>    IndexReader.getTermFreqVector(int docNumber, String field)
> 
> This is exactly what I want.  So I ran the IndexFiles demo program with
> some
> test documents and started poking at the index with an IndexReader. But when
> I
> called
> 
>    IndexReader.getTermFreqVector(someDocNumber,"contents")
> 
> I get NULL back.  After a little more digging I find that for a TermVector
> to
> exist the Field has to have the TermVector flag set.  So I changes some
> lines
> in the demo FileDocument.Document method to:
> 
>     FileInputStream is = new FileInputStream(f);
>     Reader reader = new BufferedReader(new InputStreamReader(is));
>     doc.add(Field.Text("contents", reader.toString(),true));
> 
> with the "true" parameter causing the new Field to turn on the
> storeTermVector
> flag, right? So then I reindex and get the same results - getTermFreqVector
> returns NULL.  So I inspect the field list of the Document from the index:
> 
>    Document d = ir.document(td.doc());
>    System.out.println("  Path: "+d.get("path"));
>    for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
>    {
>       System.out.println(((Field)e.nextElement()).toString());
>    }
> 
> and I discover that there is now NO "contents" Field.  If I change the
> paramter
> in Field.Text to false, I get a "contents" Field but no TermVector.  To date
> I
> haven't been able to figure out how to get a TermFreqVector at all.
> 
> What am I missing?
> 
> I have looked at the documents - all the tutorials I have found just cover
> the
> basics.
> 
> I have read the news group postings related to "TermVectors" and
> "TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff
> is
> great".  So how do they know?  Where can I learn about this?  Are there any
> more
> complete user tutorials/references that cover TermVector features?
> 
> Oh, I am using the 1.4 Lucene release in case it matters.
> 
> Thanks in advance,
> 
> Matt Galloway
> Tulsa, Oklahoma
> 
> 
> (BTW, I also tired Field.UnStored with the same results.)

-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message