lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From app...@dsl.pipex.com
Subject Re: Using a TermFreqVector to get counts of all words in a document
Date Thu, 21 Oct 2010 11:47:45 GMT
Would you have an example of this or be able to point me in the direction of an example at
all?

Quoting Grant Ingersoll <gsingers@apache.org>:

> 
> On Oct 20, 2010, at 4:40 PM, Martin O'Shea wrote:
> 
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201010.mbox/%3c128
> > 7065863.4cb711077458e@netmail.pipex.net%3e will give you a better idea of
> > what I'm moving towards.
> > 
> > It's all a bit grey at the moment so further investigation is inevitable.
> > 
> > I expect that a combination of MySQL database storage and Lucene indexing
> is
> > going to be the end result.
> 
> I'd likely take the TermVectorMapper approach, but otherwise, yeah, I think
> you are on the right track.
> 
> 
> > 
> > 
> > 
> > -----Original Message-----
> > From: Grant Ingersoll [mailto:gsingers@apache.org] 
> > Sent: 20 Oct 2010 21 20
> > To: java-user@lucene.apache.org
> > Subject: Re: Using a TermFreqVector to get counts of all words in a
> document
> > 
> > 
> > On Oct 20, 2010, at 2:53 PM, Martin O'Shea wrote:
> > 
> >> Uwe
> >> 
> >> Thanks - I figured that bit out. I'm a Lucene 'newbie'.
> >> 
> >> What I would like to know though is if it is practical to search a single
> >> document of one field simply by doing this:
> >> 
> >> IndexReader trd = IndexReader.open(index);
> >>       TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
> >>       String[] terms = tfv.getTerms();
> >>       int[] freqs = tfv.getTermFrequencies();
> >>       for (int i = 0; i < tfv.getTerms().length; i++) {
> >>           System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
> >>       }
> >>       trd.close();
> >> 
> >> where docId is set to 0.
> >> 
> >> The code works but can this be improved upon at all?
> >> 
> >> My situation is where I don't want to calculate the number of documents
> > with
> >> a particular string. Rather I want to get counts of individual words in a
> >> field in a document. So I can concatenate the strings before passing it
> to
> >> Lucene.
> > 
> > Can you describe the bigger problem you are trying to solve?  This looks
> > like a classic XY problem: http://people.apache.org/~hossman/#xyproblem
> > 
> > What you are doing above will work OK for what you describe (up to the
> > "passing it to Lucene" part), but you probably should explore the use of
> the
> > TermVectorMapper which provides a callback mechanism (similar to a SAX
> > parser) that will allow you to build your data structures on the fly
> instead
> > of having to serialize them into two parallel arrays and then loop over
> > those arrays to create some other structure.
> > 
> > 
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


-- 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message