lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcos Juarez Lopez <mjua...@gmail.com>
Subject Re: Custom FieldInfo.IndexOptions
Date Fri, 20 Sep 2013 15:59:43 GMT
Thanks for your quick response Mike.  I'll be sure to pay more attention to
amount vs quantity in the future :)

Just one clarification.  I didn't mention we actually are using phrase and
proximity queries, which I believe use the position information.  If that's
the case, is there a way to specify DOCS_AND_POSITIONS as an IndexOption?
 Based on the enum in FieldInfo, it seems you have to include frequencies
if you want positional information.  Does positional information require
frequencies somehow, or is it just an option that's not supported currently?

Thanks,

Marcos Juarez


On Fri, Sep 20, 2013 at 5:02 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Sep 19, 2013 at 7:18 PM, Marcos Juarez Lopez <mjuarez@gmail.com>
> wrote:
> > Hi,
> >
> > I'm trying to optimize an index we have, and one thing that has come up
> > recently is that we're not really using term frequencies, and we don't
> need
> > any scoring.  We noticed that the term frequencies (.doc files) are a
> > significant chunk of the total index size, and we'd like to reduce those,
> > or eliminate them, if at all possible.
>
> You should index with DOCS_ONLY; you will still have .doc files, but
> they will be smaller since they won't store frequencies.  Also, you
> won't have .pos files anymore ... (unless other fields are still
> indexed "normally").
>
> You should also omit norms (no more / smaller .nrm files).
>
> > We don't do any sort of ranking, or scoring, and so I believe wouldn't
> need
> > to store, or to use, any term frequencies (please correct me if I'm wrong
> > on this assumption). The way our indexes work, we want to always return
> all
> > matching documents, regardless of the amount of documents returned.
>
> My silly pet peeve: it really should be "number of documents" not
> "amount of documents".  You can have an amount of uncountable things
> like water and happiness, but things that can be counted are "numbers
> of ...".
>
> > I've been looking at several things, specifically the
> > FieldInfo.IndexOptions and creating a custom FieldType that implements
> > IndexableFieldType, so that it would not store any of the TermVector
> info.
> >  However, I want to make sure I'm on the right path, before I start
> > changing our app.
>
> That's exactly the right approach.  You can fork an existing FieldType
> and tweak it, e.g.:
>
>   FieldType myType = new FieldType(TextField.TYPE_NOT_STORED);
>   myType.setIndexOptions(IndexOptions.DOCS_ONLY);
>   myType.setOmitNorms(true);
>   myType.freeze();
>
> Do that once up front in your app, then, per doc, be sure to use myType,
> e.g.:
>
>   doc.add(new Field("body", contents, myType));
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message