lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: advice on integrating NLP engine during indexing
Date Thu, 20 Dec 2007 14:55:33 GMT
FYI: you will get a broader audience on java-user, this list is mostly  
for discussion of higher level Lucene things that effect two or more  
of the Lucene projects.

That being said, a custom analyzer is the way to go to redact the  
appropriate information.  If you have your files in some sort of  
markup, you can easily create fields to contain the various metadata  
that you have generated (i.e. history of violence.)  One new thing  
that I have been intrigued with for use in NLP applications is the new  
TeeTokenFilter and SinkTokenizer that can be used to siphon off  
interesting tokens for other fields based on the tokens of an existing  
field.  This can save on the need to reanalyze content over and over  
for different analysis needs.  This is, however, advanced usage for  
now (although I hope it will become more common)


On Dec 20, 2007, at 9:48 AM, 1world1love wrote:

> Greetings all. I am new to Lucene and am looking for a little
> advice/direction/feedback on what I am trying to do. I want to index  
> and
> query millions of documents that are unstructured and resemble
> crime/police/phsychiatric reports; no problem, lucene is perfect for  
> this.
> The trick is that I need to exclude certain terms from the index  
> such as
> those terms that are negated or information that could potentially  
> identify
> people. I have a collection of natural language processing tools  
> that are
> able to tag or remove/replace such terms.
> I need to design the indexing such that I can feed each document  
> through
> these tools and then incorporate the results into the indexing  
> strategy.
> As an example, if I have a report that has the phrase: "Mr. Smith  
> has no
> history of violence against women prior to this event"
> The NLP engine would recognize the name Smith and the negation of  
> the term
> "violence" and would tag them as such. I would then like to exclude  
> those
> terms from the indexing as seems prudent.
> Another strategy I would like to look at is to include the tags in  
> the index
> to incorprate it into the search engine. That is to say, whether a  
> subject
> "likely" has a history of violence, "may" have a history of  
> violence, or
> "does not" have a history of violence.
> I assume that I will need to design a custom analyzer to do this,  
> but I was
> hoping to solicit any comments, advice, or general suggestions  
> before I get
> started.
> Thanks in advance,
> j
> -- 
> View this message in context:
> Sent from the Lucene - General mailing list archive at

Grant Ingersoll

Lucene Helpful Hints:

View raw message