lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: storing position - keyword
Date Thu, 06 Mar 2008 00:52:59 GMT
1world1love skrev:
> Greetings all. I am indexing a set of documents where I am extracting terms
> and mapping them to a controlled vocabulary and then placing the matched
> vocabulary in a keyword field.

One could also say you are classifying your data based on keywords in
the text?

> What I want to know is if there is a way to store the original term location
> with the keyword field?

You can always store values in a field, but the term and the stored
value is not coupled. Thus you would need to store the positions per
document in each field in machine readable format you then parse:

doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..

But that is a way expensive solution.

> Example Text: "The quick brown fox jumped over the lazy dog" -->
> 
> Controlled Vocabulary Terms: "physical activity", "exercise", "sedentary
> lifestyle", "canine"
> 
> I am storing these controlled terms in a keyword field so they are stored
> and searchable exactly. 

This is known as faceted classification.

<http://en.wikipedia.org/wiki/Faceted_classification>
<http://www.nabble.com/forum/Search.jtp?query=facets&local=y&forum=44>

> What I would like to be able to do is to highlight the context of the
> original term or phrase that is associated with a mapped term. So in the
> example above, if the controlled term is "sedentary lifestyle", I would like
> to highlight "lazy".
>
> There can be multiple mapped terms for an original term or phrase.
>
> The algorithm that handles the mapping provides the start and end
> position of the original text


Are you aware of the hightlighter contrib module?

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

The simplest solution is to a new facet Term per classification in text
and use the text start and end positions of the text field, and have the
hightligher to load the text and highlight this text field.

Matching a document with the same terms occuring multiple times will
cause a greater score than it only occuring once. This is probably
problematic for you.

Instead you could add a single Term, ignore the built in positions and
store them for all positions in the payload of that single Term.


for (String facet : facets) {
   doc.addField(
       "f", new SingleTokenTokenStream(
           facet, new Payload(offsets.toByteArray())
       )
   );
}

(This is dry coded, you will need to implement some of them things.)

You also need to modify the highligher so it can read this data.



    karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message