lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Haslhofer <bernhard.haslho...@univie.ac.at>
Subject multi-term synonym expansion
Date Tue, 06 Jul 2010 13:02:25 GMT
Hi,

I am currently working on a Lucene module that makes use of controlled SKOS vocabularies (http://www.w3.org/TR/skos-primer/)
during index and search time. It should work similar to Lucene's Wordnet contrib module, just
with some extended SKOS-specific functionality (e.g., support for broader & narrower relationships).
Work is still very much in progress; first results are available here: https://code.google.com/p/lucene-skos/

My custom SKOSAnalyzer already performs synonym expansion based on the labels defined in a
given SKOS model. But now I have the problem that real-world thesauri often define (multi
terms) synonyms for mult-term words. Here is an example that defines the abbreviation "UN"
as synonym for "United Nations"

<skos:Concept rdf:about="http://www.cs.univie.ac.at/thesaurus/concept/6">
      <skos:prefLabel>United Nations</skos:prefLabel>
      <skos:altLabel>UN</skos:altLabel>
 </skos:Concept>

At the end the analyzer should add the term UN at the right position in the index. Taking
the example above, a sentence "I work for the United Nations" should appear in the index as


2: [work: 2-> 6]
5: [united nations: 15->29] [un: 15->29]

...so that a query "I work for the UN" also matches the document.

What is the best solution to implement that. With a TokenFilter I can work through the sentence
token by token (using incrementToken()) and check if there is a synonym available. How can
I analyze token sequences in a given text? Do I need to implement a custom tokenizer that
recognizes entities based on a given dictionary?

I am grateful for any suggestions or advice.

Thank you,

Bernhard




______________________________________________________
Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna

Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
E-Mail: bernhard.haslhofer@univie.ac.at
WWW: http://www.cs.univie.ac.at/bernhard.haslhofer


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message