lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan McKinley <>
Subject Re: Tagging documents as they are indexed -- Is FST a reasonable approach?
Date Wed, 04 Jan 2012 00:04:00 GMT
On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir <> wrote:
> On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley <> wrote:
>> Just brainstorming, it seems like an FST could be a good/efficient way
>> to match documents.  My plan would be to:
>> 1. Use an Analyzer to create a TokenStream for each place name.  From
>> the TokenStream create an FST<docid> -- this would have to pick some
>> impossible character for the token seperator.
>> 2. While indexing, create a TokenStream from the input text.  For each
>> token, try to follow the Arc to a match.  If there is a match, add it
>> to the document.
>> Does this approach seem reasonable?
>> Is there some standard way to do this that I am missing?
> I'm not really sure this will fit well inside a tokenstream at all, as
> it seems more like the kind of thing you would do before analysis,

For sure -- any pointers on how to best do this?

It seems like using the existing lucene infrastructure to:
 - normalize latin characters
 - lowercase
 - remove stopwords
 - break camelcase
 - etc

is there something else I should be looking at that is better suited to do this?

> and at analysis you would be worried about how you are going to index the
> text for search, what you are going to do with the location (separate
> field or whatever), etc.
> apart from that - as far as whether or not to use an FST, it seems ok
> to me, especially if the data used for geocoding is pretty static.
> if you want to prototype using an FST inside a tokenstream to do this,
> just convert your geocoding data into a synonyms file (mapping to the
> location), use SynonymsFilter, and you are done.

excellent - i will look there.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message