lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Tagging documents as they are indexed -- Is FST a reasonable approach?
Date Wed, 04 Jan 2012 00:26:57 GMT
On Tue, Jan 3, 2012 at 7:04 PM, Ryan McKinley <> wrote:
> On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir <> wrote:
>> On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley <> wrote:
>>> Just brainstorming, it seems like an FST could be a good/efficient way
>>> to match documents.  My plan would be to:
>>> 1. Use an Analyzer to create a TokenStream for each place name.  From
>>> the TokenStream create an FST<docid> -- this would have to pick some
>>> impossible character for the token seperator.
>>> 2. While indexing, create a TokenStream from the input text.  For each
>>> token, try to follow the Arc to a match.  If there is a match, add it
>>> to the document.
>>> Does this approach seem reasonable?
>>> Is there some standard way to do this that I am missing?
>> I'm not really sure this will fit well inside a tokenstream at all, as
>> it seems more like the kind of thing you would do before analysis,
> For sure -- any pointers on how to best do this?
> It seems like using the existing lucene infrastructure to:
>  - normalize latin characters
>  - lowercase
>  - remove stopwords
>  - break camelcase
>  - etc
> is there something else I should be looking at that is better suited to do this?

hmm just inferring from your text, it sounds like you are also trying
to do some fuzzy matching/place name standardization.

So i would tend to look more towards record-linkage techniques as a
start, but i think this is going to depend a lot on what your app
requires and the domain specifics of place names
e.g. if i have some valid tiny city sdfsdfdsfs, RI then maybe its ok
you geocode it to RI since the state is small anyway.
and washington, DC is ok for DC.
but if its just jacksonville, i know this city exists in both florida
and north carolina at least maybe its bogus to geocode it to anything,
etc, etc.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message