lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Vink <ianv...@gmail.com>
Subject Re: IndexReader.Terms - internals
Date Mon, 11 May 2009 19:30:19 GMT
Thanks guys,
Here's what I built:

http://BahaiResearch.com

It allows any language speaker to read about another person's religion in
any language. Helps promote unity in diversity. It's open source.

Ian



On Mon, May 11, 2009 at 1:39 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> No, there is no other way to do this. And if you think, the TermEnum takes
> too much RAM when returning all terms and also from different, you can be
> sure, that there is no wasted memory, as the term enum does not allocate
> the
> whole terms (like normal Java iterators). The term enum is iterated on disk
> and terms are loaded from there (this is why it throws IOException).
>
> The reason behind this behaviour is simple:
> IR.terms(term) returns all terms >= the given term (see javadoc), not all
> terms starting with a specific field. Terms are ordered by fieldname and
> then text. Because of this it looks like the TermEnum would only return
> terms of this field. One special case is:
> If the field name does not exist in the Index, IR.terms(term) would also be
> positioned on the first term >= the given one, but as the field does not
> exist, it would be the first term of the alphabetically next field name.
>
> So in gernal you stop iterating when no more terms are available or the
> field name of the current term != the requested field. Almost all internal
> algorithms inside Lucene (PrefixQuery, RangeQuery,...) work in this way!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: David Causse [mailto:dcausse@spotter.com]
> > Sent: Monday, May 11, 2009 6:21 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: IndexReader.Terms - internals
> >
> > Hi,
> > We noticed this behaviour also, so we do like this :
> >
> > Map<Term, Integer> result = new HashMap<Term, Integer>();
> > TermEnum all;
> > if(matcher.fullScan()) {
> >         all = reader.terms(new Term(field));
> > } else {
> >         all = reader.terms(new Term(field, matcher.prefix()));
> > }
> > if(all == null) return result;
> > Term t;
> > do {
> >         t = all.term();
> >         if(t != null && matcher.match(t.text()))
> >                 result.put(t,all.docFreq());
> >
> > } while(all.next() && all.term().field() == field && (matcher.fullScan()
> > ? true : t.text().startsWith(matcher.prefix())));
> > return result;
> >
> > matcher is an application level object it is designed to match complex
> > word. So we loop on the TermEnum until we consider we reached the end of
> > interesting information.
> > To summarize: you stop the loop when
> > 1. there is no more data in TermEnum
> > 2. the field is not the same (don't forget to intern String field if it
> > comes from outside)
> > 3. you reached non-matching Terms by checking a prefix.
> >
> > If there is better way to do I'd be glad to hear of.
> >
> > David.
> >
> > Ian Vink a écrit :
> > >             IndexReader rdr = IndexReader.Open(myFolder);
> > >             TermEnum terms = rdr.Terms((new Term(myTermName, "")));
> > >
> > > (from .NET land, but it's all the same)
> > >
> > > This code works great, I can loop thru the terms nicely, but after it
> > > returns all the myTermName terms, it goes into all other terms.
> > >
> > > Is there a way to limit the rdr.Terms to return only those whose field
> > is
> > > myTermName
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message