lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tm-ole...@comcast.net>
Subject Indexing single words and marked phrases
Date Sat, 03 Mar 2007 00:32:07 GMT
I am working on a project with team that is developing a named entity
recognizer. I need to configure Lucene indexing so that it indexes the
individual words in the text that the named entity recognizer outputs as
well as the phrases that it marks. For example, in a string like

 

<BODY>"The population of <CITY>New York City</CITY> is not as large as that
of <CITY>Mexico City</CITY>."</BODY>

 

I would want to index the words "New", "York" and "City" and "Mexico" and
"City" as well as the phrases "New York City" and "Mexico City", along with
the other non-stopwords in the string in a field labeled "BODY". Would it be
better to write an Analyzer that can do this, or to adjust the XML parser
that I am using so that it adds the text within tag pairs like <CITY> and
</CITY> to the field that corresponds to the tag that is one level up. If it
is better to write an Analyzer, could someone point me to information on how
to do this? Thanks.

Mike O'Leary


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message