lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <john.by...@propylon.com>
Subject Re: storing position - keyword
Date Thu, 06 Mar 2008 09:27:17 GMT
"To confuse matters more, it is not really a matter of synonyms, as the 
orginal term is discarded from the index and there is only one mapped term"

I'm not sure I fully understand this: am I right in thinking that you 
will be searching using these controlled volcabulary words, and that the 
search must then find any of the ordinary words which map to the 
controled vocaburlary words, and highlight them?

Because if that's the case, I think it's relatively simple: You create a 
separate index, which only maps the controlled vocubulary to the 
ordinary words. That's your "synonyms" index. Then, you index your 
target document as normal. When you search, you first look up your 
search term against the synonyms index. So, following your exmaple, if 
you looked up "dog" in the synoyms index, youd get back "chien", "canis" 
and "cane". (Achieving this part is easy: you just keep adding 
"synonyms" to the field at the same position.) Whether or not the 
returned list also contains the orignal "dog" is up to you when you 
create your synonyms index. (In a typical synonyms ring, the original 
word would have to be in there, because you don't know which word will 
be used to search)

Now all you have to do is combine those returned terms as Boolean OR 
clauses in a single BooleanQuery, and search on the main index. You'll 
find all documents containing any of those 3 words, and you can use the 
highlighting code form the Lucene contrib projects to highlight

Does this help? Forgive me if I've misunderstood or undersetimated the 
problem!

Regards,
-John

per original term or phrase and the algorithm determines the controlled
meaning from the context.

1world1love wrote:
> First off Karl, thanks for your reply and your time.
>
>
>
> karl wettin-3 wrote:
>   
>> One could also say you are classifying your data based on keywords in
>> the text?
>>
>>     
>
> I probably didn't explain myself very well or more specifically provide a
> good example. In my case, there really isn't any relationship between the
> mapped terms per document. That is to say that an individual term or phrase
> in the document is mapped to a concrete concept in a controlled vocabulary.
> The concept doesn't represent a class of anything and no relationship exists
> between the concepts. They would never be grouped by any means. It is more a
> matter of replacing some arbitrary word or phrase with an adjudicated
> version.
>
> The example I gave did in fact use classifications for the terms, but that
> is not exactly the point that I was trying to convey. I suppose a better
> example would be where each term or phrase in the sentence mapped to any
> equivilent in another language:
>
> dog -> canis
> dog -> cane
> dog -> chien
>
> So that if you searched for "canis", then any document with "dog" would be
> returned (unless the context inferred that dog meant something else). By the
> same token, if the text was "here we go" or "let's go", then it may map to
> "vamos" or "vamonos".
>
> To confuse matters more, it is not really a matter of synonyms, as the
> orginal term is discarded from the index and there is only one mapped term
> per original term or phrase and the algorithm determines the controlled
> meaning from the context.
>
>
> karl wettin-3 wrote:
>   
>> You can always store values in a field, but the term and the stored
>> value is not coupled. Thus you would need to store the positions per
>> document in each field in machine readable format you then parse:
>>
>> doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..
>>
>> But that is a way expensive solution.
>>
>>     
>
> Indeed, though doesn't a analyzed field have some other information attached
> to it?
>
> Forgive me if this is a naive question. I am fairly new to Lucene.
>
>
> karl wettin-3 wrote:
>   
>>  
>>
>> This is known as faceted classification.
>>
>> <http://en.wikipedia.org/wiki/Faceted_classification>
>> <http://www.nabble.com/forum/Search.jtp?query=facets&local=y&forum=44>
>>
>>     
>
> Again, I am not overly familiar with these disciplines, but I always thought
> of facets as a organizational strategy. As I said, my example betrayed me a
> bit, as I am not that interested in organizing these documents, rather
> providing a controlled vocabulary from which to search as opposed to any
> random text.
>
>
>
> karl wettin-3 wrote:
>   
>> Are you aware of the hightlighter contrib module?
>>
>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/
>>
>> The simplest solution is to a new facet Term per classification in text
>> and use the text start and end positions of the text field, and have the
>> hightligher to load the text and highlight this text field.
>>
>>     
>
> This is actually not a web based application and the highlighting would
> really only be used for analyzing performance of the mapping algorithms. The
> main issue is that we do need to be able to provide the location of the
> original term for each mapped keyword.
>
>
>
> karl wettin-3 wrote:
>   
>> Matching a document with the same terms occuring multiple times will
>> cause a greater score than it only occuring once. This is probably
>> problematic for you.
>>
>>     
>
> It may not be that big of an issue.
>
>
> karl wettin-3 wrote:
>   
>> Instead you could add a single Term, ignore the built in positions and
>> store them for all positions in the payload of that single Term.
>>
>>
>> for (String facet : facets) {
>>    doc.addField(
>>        "f", new SingleTokenTokenStream(
>>            facet, new Payload(offsets.toByteArray())
>>        )
>>    );
>> }
>>
>> (This is dry coded, you will need to implement some of them things.)
>>
>> You also need to modify the highligher so it can read this data.
>>
>>
>>     
>
> Something like this seems like it might work well for my purposes. I will
> look at this further.
>
> Thanks again,
>
> J
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message