lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Pasero <bpas...@rssowl.org>
Subject Re: Retrieving TermVectors from a Field over the full index?
Date Sun, 10 Jun 2007 17:44:48 GMT


Erick Erickson wrote:
> Um, to return all counts of all terms in a field, what other option
> *is* there except to walk the whole thing?
>
> Have you looked at TermEnum, TermDocs, and TermFreqVector?
> For that matter, TermPositionVector might also be of some use.
>
> It would be easier to provide some help if you
> 1> mentioned what you'd tried already
> 2> mentioned what's inadequate about what you've tried.
Sorry for not being clear what I am trying to achieve. I am storing
documents in my index that are made of 5 Fields. One of the Fields
contains keywords that describe the document. Now, I need a fast
way of retrieving these keywords together with their frequency from
the index.

My current solution is to use IndexReader#terms() to walk over all
terms and count the ones that appear in the keyword-Field.

As you can assume, this is not scaling well. The content in the keywords
field is usually quite small, however, the other fields may store
up to thousands of terms.

What I am asking for is a way to walk all the terms of just the 
keyword-field
in order to avoid having to walk all terms in all fields.

Of course, even better would be some API that would return a TermVector from
the keyword-field. But I guess TermVectors are only supported on a per
Document level and not index level?

Regards,
Ben
>
> Best
> Erick
>
> On 6/9/07, Benjamin Pasero <bpasero@rssowl.org> wrote:
>>
>> Hi,
>>
>> I wonder if this is possible:
>>
>> Return all Terms of a Field in the Index together with the number of
>> occurances
>> in all documents.
>>
>> E.g. have 10 Documents with the Field "author" in the index, 5 of them
>> having
>> the value "foo" and 5 "bar" I would like to build a map with:
>>
>> [foo] -> 5
>> [bar] -> 5
>>
>> I looked at what Luke is doing to show the top terms of a given field in
>> the
>> index and it seems to iterate over all terms (using
>> IndexReader#terms()). Isnt
>> that quite un-efficient? I would at least expect a method
>> IndexReader#terms(String field)
>> to limit the terms on the desired field.
>>
>> Thanks for helping,
>> Ben
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message