lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: How to index Named Entities
Date Tue, 03 Mar 2009 14:17:11 GMT
Have a look at the TeeTokenFilter and the SinkTokenizer.  You could  
extend/implement those to have a lookup in your list, and then when  
you have a match, add the token to the Sink, which then allows you to  
index a separate field containing your named entities.  The TeeTF and  
SinkTok are located in the contrib/analysis package of the latest  
Lucene release.   Alternatively, you could implement a TokenFilter  
that adds a payload onto a term whenever it comes across a Named Entity.

Alternatively, you might look into preprocessing with OpenNLP or  
LingPipe or some tool like that which can go beyond just list lookup  
for Named Entities.  List based approaches are useful, but they also  
tend to be brittle.

Using OpenNLP is described in my book:  
in chapter 5 and I believe Tom (my coauthor) even has code in there  
for plugging OpenNLP into the Lucene analysis process)

On Mar 3, 2009, at 1:13 AM, Seid Mohammed wrote:

> I want to index document conents in two ways, one just a simple
> content, and the other as named entity.
> the senario is like this.
> if i have this document "the source of Nile is Ethiopia"
> then I want to index "source" as a normal content, "Nile" as river
> name, and "Ethiopia" as Country name. so that later if ask a question
> "where is the source of Nile", it should retrieve Ethiopia as an
> Answer.
> Note: I will have List of River names, Country names,... so that
> during indexing I will compare every word of a document with my lists.
> thanks a lot
> Seid M
> -- 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message