lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Using a TermFreqVector to get counts of all words in a document
Date Fri, 22 Oct 2010 14:09:04 GMT
http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/
has an example of implementing a TermVectorMapper.  There are also several implementations
included in the Lucene codebase.

All it really does is give you a callback as it is reading the code from the Directory and
then you can massage the data as you see fit.

On Oct 21, 2010, at 7:47 AM, appy74@dsl.pipex.com wrote:

> Would you have an example of this or be able to point me in the direction of an example
at all?
> 
> Quoting Grant Ingersoll <gsingers@apache.org>:
> 
>> 
>> On Oct 20, 2010, at 4:40 PM, Martin O'Shea wrote:
>> 
>>> 
>> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201010.mbox/%3c128
>>> 7065863.4cb711077458e@netmail.pipex.net%3e will give you a better idea of
>>> what I'm moving towards.
>>> 
>>> It's all a bit grey at the moment so further investigation is inevitable.
>>> 
>>> I expect that a combination of MySQL database storage and Lucene indexing
>> is
>>> going to be the end result.
>> 
>> I'd likely take the TermVectorMapper approach, but otherwise, yeah, I think
>> you are on the right track.
>> 
>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org] 
>>> Sent: 20 Oct 2010 21 20
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Using a TermFreqVector to get counts of all words in a
>> document
>>> 
>>> 
>>> On Oct 20, 2010, at 2:53 PM, Martin O'Shea wrote:
>>> 
>>>> Uwe
>>>> 
>>>> Thanks - I figured that bit out. I'm a Lucene 'newbie'.
>>>> 
>>>> What I would like to know though is if it is practical to search a single
>>>> document of one field simply by doing this:
>>>> 
>>>> IndexReader trd = IndexReader.open(index);
>>>>      TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
>>>>      String[] terms = tfv.getTerms();
>>>>      int[] freqs = tfv.getTermFrequencies();
>>>>      for (int i = 0; i < tfv.getTerms().length; i++) {
>>>>          System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
>>>>      }
>>>>      trd.close();
>>>> 
>>>> where docId is set to 0.
>>>> 
>>>> The code works but can this be improved upon at all?
>>>> 
>>>> My situation is where I don't want to calculate the number of documents
>>> with
>>>> a particular string. Rather I want to get counts of individual words in a
>>>> field in a document. So I can concatenate the strings before passing it
>> to
>>>> Lucene.
>>> 
>>> Can you describe the bigger problem you are trying to solve?  This looks
>>> like a classic XY problem: http://people.apache.org/~hossman/#xyproblem
>>> 
>>> What you are doing above will work OK for what you describe (up to the
>>> "passing it to Lucene" part), but you probably should explore the use of
>> the
>>> TermVectorMapper which provides a callback mechanism (similar to a SAX
>>> parser) that will allow you to build your data structures on the fly
>> instead
>>> of having to serialize them into two parallel arrays and then loop over
>>> those arrays to create some other structure.
>>> 
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message