lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan McKinley <>
Subject Tagging documents as they are indexed -- Is FST a reasonable approach?
Date Tue, 03 Jan 2012 21:30:22 GMT
Happy new year!

I'm working on a way to simple geocode documents as they are indexed.
I'm hoping to use existing Lucene infrastructure to do this as much as
possible.  My plan is to build an index of known place names then look
for matches in incoming text.  When there is a match, some extra
fields will get added to the index.

The known place list will include things like:
 * The People's Republic of China
 * Rome
 * New York

I want to match documents where this phrase (normalized for
capitalization/punctuation/etc) appears in the document.  It looks
like MemoryIndex was made to do something like this: Create a
MemoryIndex for each item you want to match, then run the document
against each possible value and see if it matches.  Without testing
this approach, it seems kinda crazy if we have ~100K+ placenames.  I
am also concerned how this would work with long phrases and things
that may match with "The Peoples Republic of *"

Just brainstorming, it seems like an FST could be a good/efficient way
to match documents.  My plan would be to:

1. Use an Analyzer to create a TokenStream for each place name.  From
the TokenStream create an FST<docid> -- this would have to pick some
impossible character for the token seperator.
2. While indexing, create a TokenStream from the input text.  For each
token, try to follow the Arc to a match.  If there is a match, add it
to the document.

Does this approach seem reasonable?
Is there some standard way to do this that I am missing?

thanks for any pointers!


The two approaches I am considering:

1. MemoryIndex -- build a MemoryIndex for each place name.  Check every index

2. FST -- Use an Analyzer to get a TokenStream for each input name and
build an FST<docid> based on the input.  Then analyze the text while
indexing and use the TokenStream to

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message