lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Tagging documents as they are indexed -- Is FST a reasonable approach?
Date Tue, 03 Jan 2012 21:44:18 GMT
On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley <ryantxu@gmail.com> wrote:
>
> Just brainstorming, it seems like an FST could be a good/efficient way
> to match documents.  My plan would be to:
>
> 1. Use an Analyzer to create a TokenStream for each place name.  From
> the TokenStream create an FST<docid> -- this would have to pick some
> impossible character for the token seperator.
> 2. While indexing, create a TokenStream from the input text.  For each
> token, try to follow the Arc to a match.  If there is a match, add it
> to the document.
>
> Does this approach seem reasonable?
> Is there some standard way to do this that I am missing?
>

I'm not really sure this will fit well inside a tokenstream at all, as
it seems more like the kind of thing you would do before analysis, and
at analysis you would be worried about how you are going to index the
text for search, what you are going to do with the location (separate
field or whatever), etc.

apart from that - as far as whether or not to use an FST, it seems ok
to me, especially if the data used for geocoding is pretty static.

if you want to prototype using an FST inside a tokenstream to do this,
just convert your geocoding data into a synonyms file (mapping to the
location), use SynonymsFilter, and you are done.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message