lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Obtaining the (indexed) terms in a field in a particular document
Date Tue, 20 Mar 2007 20:44:45 GMT
Well, depending upon your storage requirements, it's actually
much easier than that. Assuming you're adding
this field (or a duplicate) as UN_TOKENIZED (in this case, no
need to store), you can just spin
through all the terms for that field with TermDocs/TermEnum.
The trick is to have your term start with a value of "". I.e.
new Term(field, "") to enumerate them all. See TermDocs.seek.

This is without TermVectors at all, which'll save you some space.

Erick

On 3/20/07, Donna L Gresh <gresh@us.ibm.com> wrote:
>
> Thanks, I see what you are saying.
>
> Seems that if I create the field at index time with term vectors stored,
> then I can iterate through the documents and get both the unique
> identifier and the terms, right? My original question was imprecise in
> that I'm going to want to get all the terms for *all* the documents (one
> document at a time) so I can just iterate through all the documents using
>
>                 for (int i=0; i<indexReaderR.numDocs(); i++) {
>                         TermFreqVector tfv =
> indexReaderR.getTermFreqVector(i,"my text field name");
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>
>
>
>
> "Erick Erickson" <erickerickson@gmail.com>
> 03/20/2007 03:08 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Obtaining the (indexed) terms in a field in a particular document
>
>
>
>
>
>
> Sorry, but you have to have the Lucene document ID, which you
> can get either as part of a Hits or HitCollector or...
> or by using TermDocs/TermEnum on your unique id (my_id in
> your example).
>
> Erick
>
> On 3/20/07, Erick Erickson <erickerickson@gmail.com> wrote:
> >
> > You can do a document.get(field), *assuming* you have stored the data
> > (Field.Store.YES) at index time, although you may not get
> > stop words.
> >
> > On 3/20/07, Donna L Gresh <gresh@us.ibm.com> wrote:
> > >
> > > My apologies if this is a simple question--
> > >
> > > How can I get all the (stemmed and stop words removed, etc.) terms in
> a
> > > particular field of a particular document?
> > >
> > > Suppose my documents each consist of two fields, one with the name
> > > "my_id"
> > > and a unique identifier, and the other being some text string
> consisting
> > > of a number of words.
> > > I'd like to get all the terms in the text string given the unique
> > > identifier.
> > >
> > > (My basic reason is to do a sort of document similarity between the
> text
> > >
> > > string and some other text string, doing a boolean query with
> > > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > > suggestions of better ways to do this)
> > >
> > > Donna L. Gresh
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message