lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Term numbering and range filtering
Date Mon, 10 Nov 2008 18:54:23 GMT
Tim,

I didn't follow all the details, so this may be somewhat off,
but did you consider using TermVectors?

Regards,
Paul Elschot


Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
> Yes, that is a significant issue. What I'm coming to realize is that
> either I will end up with something like
>
> class MultiFilter {
>    String field;
>    private int[] termInDoc;
>    Map<Term,int> termToInt;
>    ...
> }
>
> which can be entirely built on the current lucene APIs but has
> significantly more overhead (the termToInt mapping in particular and
> the need to construct the mapping and array on startup)
>
>
> Or I can go deep into the guts and add a data file per-segment with a
> format something like
>
> int version
> int numFields
> (int fieldNum, long offset) ^ numFields
> (int termForDoc) ^ (maxDocs * numFields)
>
> and add something to FieldInfo  like  boolean storeMultiFilter;
> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to
> add an int termNum to the .tis file as well.
>
> This is clearly a lot more work than the first solution, but it is a
> lot nicer to deal with as well. Is this interesting to anyone other
> than me?
>
> Tim
>
> On 11/9/08 12:23 PM, "Michael McCandless" <lucene@mikemccandless.com> 
wrote:
> > Conceivably, TermInfosReader could track the sequence number of
> > each term.
> >
> > A seek/skipTo would know which sequence number it just jumped too,
> > because the index is regular (every 128 terms by default), and then
> > each next() call could increment that.  Then retrieving this number
> > would be as costly as calling eg IndexReader.docFreq(Term) is now.
> >
> > But I'm not sure how a multi-segment index  would work, ie how
> > would MultiSegmentReader compute this for its terms?  Or maybe
> > you'd just do this per-segment?
> >
> > Mike
> >
> > Tim Sturge wrote:
> >> Hi,
> >>
> >> I¹m wondering if there is any easy technique to number the terms
> >> in an index
> >> (By number I mean map a sequence of terms to a contiguous range of
> >> integers
> >> and map terms to these numbers efficiently)
> >>
> >> Looking at the Term class and the .tis/.tii index format it
> >> appears that the
> >> terms are stored in an ordered and prefix-compressed format, but
> >> while there
> >> are pointers from a term to the .frq and .prx files, neither is
> >> really suitable as a sequence number.
> >>
> >> The reason I have this question is that I am writing a
> >> multi-filter for
> >> single term fields. My index contains many fields for which each
> >> document
> >> contains a single term (e.g. date, zipcode, country) and I need to
> >> perform
> >> range queries or set matches over these fields, many of which are
> >> very inclusive (they match >10% of the total documents)
> >>
> >> A cached RangeFilter works well when there are a small number of
> >> potential
> >> options (e.g. for countries) but when there are many options
> >> (consider a
> >> date range or a set of zipcodes) there are too many potential
> >> choices to
> >> cache each possibility and it is too inefficient to build a filter
> >> on the
> >> fly for each query (as you have to visit 10% of documents to build
> >> the filter despite the query itself matching 0.1%)
> >>
> >> Therefore I was considering building a int[reader.maxDocs()] array
> >> for each
> >> field and putting into it the term number for each document. This
> >> relies on
> >> the fact that each document contains only a single term for this
> >> field, but
> >> with it I should be able to quickly construct a ³multi-filter²
> >> (that is,
> >> something that iterates the array and checks that the term is in
> >> the range
> >> or set).
> >>
> >> Right now it looks like I can do some very ugly surgery and
> >> perhaps use the
> >> offset to the prx file even though it is not contiguous. But I¹m
> >> hoping
> >> there is a better technique that I¹m just not seeing right now.
> >>
> >> Thanks,
> >>
> >> Tim
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message