lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graham Sugden <gras...@gmail.com>
Subject Multiple fields derived from same source text?
Date Thu, 18 Aug 2011 16:23:42 GMT
Hi,

I am just beginning to implement text indexation for an application I am
building and am not quite sure of a few things. The documents indexed will
be in various languages, ranging mostly from short notes to ~20 page
articles (with the occaisional book length). And so my plan is to have
separate indexes for each language, each of which would contain a number of
fields created from the same text analyzed in a number of ways. So for an
English document I might have fields

         stem, suffix, token

generated from the same text with respectively

        an EnglishAnalyzer(), A custom analyzer with a
ReverseStringFilter(), and StandardAnalyzer().

As doing things this way seems to mean having the text go through Standard
and Stopword filters 3 times, once for each field, I am wondering if the
there is a way to do something like this (with custom
analyzers/implementation of PerFieldAnalyzer (or even out of the box--I'm
very new to lucene)) that could avoid that duplicate processing*? Maybe a
way to store the result of the analysis for the "token" field**, to be
reused as the start point for the analysis for the "stem" and "suffix"
fields (which would then just need the application of a Stemming filter and
the ReverseStringFilter respectively).

Note I am keen to avoid any pre-analysis processing of text as I would like
to keep the offsets etc in line with the sources (stored externally) for hit
highlighting when I eventually get that far!

Any help/advice greatly appreciated.

Thanks and kind regards,

graham

* in languages with requiring removal of diacritics for some fields and not
others, etc, there will I guess be more duplication.
**would this be achievable with reusableTokenStream()--(with my google
skills) I haven't been able to get any clear idea of how to go about using
this.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message