lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <>
Subject Re: Tagging documents as they are indexed -- Is FST a reasonable approach?
Date Wed, 04 Jan 2012 09:18:12 GMT
Hi Ryan,

Why not preprocessing your documents with tools like Apache UIMA, GATE or
OpenNLP before indexing them in Lucene? GATE for instance has FST-based
gazetteers which would be perfect for your place names, AFAIK there is also
a Dictionary component for UIMA which would be a good match.


On 3 January 2012 21:30, Ryan McKinley <> wrote:

> Happy new year!
> I'm working on a way to simple geocode documents as they are indexed.
> I'm hoping to use existing Lucene infrastructure to do this as much as
> possible.  My plan is to build an index of known place names then look
> for matches in incoming text.  When there is a match, some extra
> fields will get added to the index.
> The known place list will include things like:
>  * The People's Republic of China
>  * Rome
>  * New York
> I want to match documents where this phrase (normalized for
> capitalization/punctuation/etc) appears in the document.  It looks
> like MemoryIndex was made to do something like this: Create a
> MemoryIndex for each item you want to match, then run the document
> against each possible value and see if it matches.  Without testing
> this approach, it seems kinda crazy if we have ~100K+ placenames.  I
> am also concerned how this would work with long phrases and things
> that may match with "The Peoples Republic of *"
> Just brainstorming, it seems like an FST could be a good/efficient way
> to match documents.  My plan would be to:
> 1. Use an Analyzer to create a TokenStream for each place name.  From
> the TokenStream create an FST<docid> -- this would have to pick some
> impossible character for the token seperator.
> 2. While indexing, create a TokenStream from the input text.  For each
> token, try to follow the Arc to a match.  If there is a match, add it
> to the document.
> Does this approach seem reasonable?
> Is there some standard way to do this that I am missing?
> thanks for any pointers!
> ryan
> The two approaches I am considering:
> 1. MemoryIndex -- build a MemoryIndex for each place name.  Check every
> index
> 2. FST -- Use an Analyzer to get a TokenStream for each input name and
> build an FST<docid> based on the input.  Then analyze the text while
> indexing and use the TokenStream to
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

*Open Source Solutions for Text Engineering

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message