lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: Retrieving TermVectors from a Field over the full index?
Date Mon, 11 Jun 2007 08:50:41 GMT

I see the issue.  If your "keywords" field has a tiny set of terms
(say 1000) vs the rest of your fields (say 1 million) then linear
search is a very slow way to step through the terms for the field
"keywords".

It seems like what's missing is the efficient ability for
TermDocs/TermInfosReader to seek to the first term of a given field.
They already can seek to a given specific term; with some small
changes to look only at the field name part of a Term presumably
TermInfosReader could be fixed to seek to fields as well.

But in the meantime maybe one workaround is to insert a false keyword
that you know will always sort before all of your real keywords (eg
something like ".start"), and then seek to that term before running
your code below?

Alternatively you could name your keywords field such that it's always
the first field, eg ".keywords" or something?

Mike

"Benjamin Pasero" <bpasero@rssowl.org> wrote:
> 
> 
> This is what I coded up till now:
> 
> TermEnum terms = reader.terms();
> while (terms.next()) {
>   String field = terms.term().field();
>   if ("keywords".equals(field))
>     keywords.put(terms.term().text(), terms.docFreq());
> }
> 
> Your solution is doing more or less the same right?
> 
> Ben
> > Maybe I'm missing the boat, but I don't understand why TermEnum doesn't
> > work for you.
> > Try something like...
> >
> >       TermEnum te = this.reader.terms();
> >        te.skipTo(new Term("keyword", ""));
> >        while (te.next()) {
> >            Term term = te.term();
> >            if (! term.field().equals("keyword")) {
> >                break;
> >            }
> >            System.out.println(term.text());
> >        }
> >
> >
> >
> > On 6/10/07, Benjamin Pasero <bpasero@rssowl.org> wrote:
> >>
> >>
> >>
> >> Erick Erickson wrote:
> >> > Um, to return all counts of all terms in a field, what other option
> >> > *is* there except to walk the whole thing?
> >> >
> >> > Have you looked at TermEnum, TermDocs, and TermFreqVector?
> >> > For that matter, TermPositionVector might also be of some use.
> >> >
> >> > It would be easier to provide some help if you
> >> > 1> mentioned what you'd tried already
> >> > 2> mentioned what's inadequate about what you've tried.
> >> Sorry for not being clear what I am trying to achieve. I am storing
> >> documents in my index that are made of 5 Fields. One of the Fields
> >> contains keywords that describe the document. Now, I need a fast
> >> way of retrieving these keywords together with their frequency from
> >> the index.
> >>
> >> My current solution is to use IndexReader#terms() to walk over all
> >> terms and count the ones that appear in the keyword-Field.
> >>
> >> As you can assume, this is not scaling well. The content in the keywords
> >> field is usually quite small, however, the other fields may store
> >> up to thousands of terms.
> >>
> >> What I am asking for is a way to walk all the terms of just the
> >> keyword-field
> >> in order to avoid having to walk all terms in all fields.
> >>
> >> Of course, even better would be some API that would return a TermVector
> >> from
> >> the keyword-field. But I guess TermVectors are only supported on a per
> >> Document level and not index level?
> >>
> >> Regards,
> >> Ben
> >> >
> >> > Best
> >> > Erick
> >> >
> >> > On 6/9/07, Benjamin Pasero <bpasero@rssowl.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I wonder if this is possible:
> >> >>
> >> >> Return all Terms of a Field in the Index together with the number of
> >> >> occurances
> >> >> in all documents.
> >> >>
> >> >> E.g. have 10 Documents with the Field "author" in the index, 5 of 
> >> them
> >> >> having
> >> >> the value "foo" and 5 "bar" I would like to build a map with:
> >> >>
> >> >> [foo] -> 5
> >> >> [bar] -> 5
> >> >>
> >> >> I looked at what Luke is doing to show the top terms of a given field
> >> in
> >> >> the
> >> >> index and it seems to iterate over all terms (using
> >> >> IndexReader#terms()). Isnt
> >> >> that quite un-efficient? I would at least expect a method
> >> >> IndexReader#terms(String field)
> >> >> to limit the terms on the desired field.
> >> >>
> >> >> Thanks for helping,
> >> >> Ben
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message