lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 1world1love <jd_co...@yahoo.com>
Subject Re: storing position - keyword
Date Thu, 06 Mar 2008 02:29:20 GMT

First off Karl, thanks for your reply and your time.



karl wettin-3 wrote:
> 
> One could also say you are classifying your data based on keywords in
> the text?
> 

I probably didn't explain myself very well or more specifically provide a
good example. In my case, there really isn't any relationship between the
mapped terms per document. That is to say that an individual term or phrase
in the document is mapped to a concrete concept in a controlled vocabulary.
The concept doesn't represent a class of anything and no relationship exists
between the concepts. They would never be grouped by any means. It is more a
matter of replacing some arbitrary word or phrase with an adjudicated
version.

The example I gave did in fact use classifications for the terms, but that
is not exactly the point that I was trying to convey. I suppose a better
example would be where each term or phrase in the sentence mapped to any
equivilent in another language:

dog -> canis
dog -> cane
dog -> chien

So that if you searched for "canis", then any document with "dog" would be
returned (unless the context inferred that dog meant something else). By the
same token, if the text was "here we go" or "let's go", then it may map to
"vamos" or "vamonos".

To confuse matters more, it is not really a matter of synonyms, as the
orginal term is discarded from the index and there is only one mapped term
per original term or phrase and the algorithm determines the controlled
meaning from the context.


karl wettin-3 wrote:
> 
> 
> You can always store values in a field, but the term and the stored
> value is not coupled. Thus you would need to store the positions per
> document in each field in machine readable format you then parse:
> 
> doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..
> 
> But that is a way expensive solution.
> 

Indeed, though doesn't a analyzed field have some other information attached
to it?

Forgive me if this is a naive question. I am fairly new to Lucene.


karl wettin-3 wrote:
>  
> 
> This is known as faceted classification.
> 
> <http://en.wikipedia.org/wiki/Faceted_classification>
> <http://www.nabble.com/forum/Search.jtp?query=facets&local=y&forum=44>
> 

Again, I am not overly familiar with these disciplines, but I always thought
of facets as a organizational strategy. As I said, my example betrayed me a
bit, as I am not that interested in organizing these documents, rather
providing a controlled vocabulary from which to search as opposed to any
random text.



karl wettin-3 wrote:
> 
> Are you aware of the hightlighter contrib module?
> 
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/
> 
> The simplest solution is to a new facet Term per classification in text
> and use the text start and end positions of the text field, and have the
> hightligher to load the text and highlight this text field.
> 

This is actually not a web based application and the highlighting would
really only be used for analyzing performance of the mapping algorithms. The
main issue is that we do need to be able to provide the location of the
original term for each mapped keyword.



karl wettin-3 wrote:
> 
> Matching a document with the same terms occuring multiple times will
> cause a greater score than it only occuring once. This is probably
> problematic for you.
> 

It may not be that big of an issue.


karl wettin-3 wrote:
> 
> Instead you could add a single Term, ignore the built in positions and
> store them for all positions in the payload of that single Term.
> 
> 
> for (String facet : facets) {
>    doc.addField(
>        "f", new SingleTokenTokenStream(
>            facet, new Payload(offsets.toByteArray())
>        )
>    );
> }
> 
> (This is dry coded, you will need to implement some of them things.)
> 
> You also need to modify the highligher so it can read this data.
> 
> 

Something like this seems like it might work well for my purposes. I will
look at this further.

Thanks again,

J


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




-- 
View this message in context: http://www.nabble.com/storing-position---keyword-tp15855654p15865186.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message