lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Pasero <bpas...@rssowl.org>
Subject Re: Retrieving TermVectors from a Field over the full index?
Date Mon, 11 Jun 2007 21:32:32 GMT

Ah, I see. The code was not very obvious to behave like that :)

Btw my usecase is to simply display the mostfrequent keywords
in a UI (you could call it a Tag-Cloud).

Regards,
Ben
> No, the code I posted is not doing nearly as much work. Try it.
>
>    TermEnum te = this.reader.terms();
>        te.skipTo(new Term("keyword", ""));     *********skips terms 
> NOT in
> the keyword field.
>        while (te.next()) {
>            Term term = te.term();
>            if (! term.field().equals("keyword")) {
>                break;          ***********stops looking at terms as 
> soon as
> the keywords field is exhausted.
>            }
>
> Your code examines all the terms in the index. Mine only looks at 
> terms in
> the keywords field. Which is what you said you wanted. That combined with
> docfreq might be what you need.
>
> Erick
>
>
>
> On 6/11/07, Benjamin Pasero <bpasero@rssowl.org> wrote:
>>
>>
>>
>> This is what I coded up till now:
>>
>> TermEnum terms = reader.terms();
>> while (terms.next()) {
>>   String field = terms.term().field();
>>   if ("keywords".equals(field))
>>     keywords.put(terms.term().text(), terms.docFreq());
>> }
>>
>> Your solution is doing more or less the same right?
>>
>> Ben
>> > Maybe I'm missing the boat, but I don't understand why TermEnum 
>> doesn't
>> > work for you.
>> > Try something like...
>> >
>> >       TermEnum te = this.reader.terms();
>> >        te.skipTo(new Term("keyword", ""));
>> >        while (te.next()) {
>> >            Term term = te.term();
>> >            if (! term.field().equals("keyword")) {
>> >                break;
>> >            }
>> >            System.out.println(term.text());
>> >        }
>> >
>> >
>> >
>> > On 6/10/07, Benjamin Pasero <bpasero@rssowl.org> wrote:
>> >>
>> >>
>> >>
>> >> Erick Erickson wrote:
>> >> > Um, to return all counts of all terms in a field, what other option
>> >> > *is* there except to walk the whole thing?
>> >> >
>> >> > Have you looked at TermEnum, TermDocs, and TermFreqVector?
>> >> > For that matter, TermPositionVector might also be of some use.
>> >> >
>> >> > It would be easier to provide some help if you
>> >> > 1> mentioned what you'd tried already
>> >> > 2> mentioned what's inadequate about what you've tried.
>> >> Sorry for not being clear what I am trying to achieve. I am storing
>> >> documents in my index that are made of 5 Fields. One of the Fields
>> >> contains keywords that describe the document. Now, I need a fast
>> >> way of retrieving these keywords together with their frequency from
>> >> the index.
>> >>
>> >> My current solution is to use IndexReader#terms() to walk over all
>> >> terms and count the ones that appear in the keyword-Field.
>> >>
>> >> As you can assume, this is not scaling well. The content in the
>> keywords
>> >> field is usually quite small, however, the other fields may store
>> >> up to thousands of terms.
>> >>
>> >> What I am asking for is a way to walk all the terms of just the
>> >> keyword-field
>> >> in order to avoid having to walk all terms in all fields.
>> >>
>> >> Of course, even better would be some API that would return a 
>> TermVector
>> >> from
>> >> the keyword-field. But I guess TermVectors are only supported on a 
>> per
>> >> Document level and not index level?
>> >>
>> >> Regards,
>> >> Ben
>> >> >
>> >> > Best
>> >> > Erick
>> >> >
>> >> > On 6/9/07, Benjamin Pasero <bpasero@rssowl.org> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I wonder if this is possible:
>> >> >>
>> >> >> Return all Terms of a Field in the Index together with the 
>> number of
>> >> >> occurances
>> >> >> in all documents.
>> >> >>
>> >> >> E.g. have 10 Documents with the Field "author" in the index, 5
of
>> >> them
>> >> >> having
>> >> >> the value "foo" and 5 "bar" I would like to build a map with:
>> >> >>
>> >> >> [foo] -> 5
>> >> >> [bar] -> 5
>> >> >>
>> >> >> I looked at what Luke is doing to show the top terms of a given
>> field
>> >> in
>> >> >> the
>> >> >> index and it seems to iterate over all terms (using
>> >> >> IndexReader#terms()). Isnt
>> >> >> that quite un-efficient? I would at least expect a method
>> >> >> IndexReader#terms(String field)
>> >> >> to limit the terms on the desired field.
>> >> >>
>> >> >> Thanks for helping,
>> >> >> Ben
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message