Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
  b=FmrIww0rtMzTfhO8P2mf+wFE7WoVNbtM8NxtPXyDssynkgql3St+7P7h7wMwHbtGttEl6/nv7A2F1lVL9pfYDi1zvKU37YjY9TVKPH1SpDwDYJAA/tBbnbAFWyPXDW3w7YHDc88RpwnkjhWdLG7N6gkEPEUxQjjDSxjzKQvsSWw=;
Message-ID: <219512.55952.qm@web52904.mail.re2.yahoo.com>
Date: Tue, 6 Jul 2010 06:40:54 -0700 (PDT)
From: Ahmet Arslan <iorixxx@yahoo.com>
Subject: Re: multi-term synonym expansion
To: java-user@lucene.apache.org
In-Reply-To: <14B9ED6F-AEFC-400A-9774-33F6F4D1B0DF@univie.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

> My custom SKOSAnalyzer already performs synonym expansion=0A> based on th=
e labels defined in a given SKOS model. But now I=0A> have the problem that=
 real-world thesauri often define=0A> (multi terms) synonyms for mult-term =
words. Here is an=0A> example that defines the abbreviation "UN" as synonym=
 for=0A> "United Nations"=0A> =0A> <skos:Concept rdf:about=3D"http://www.cs=
.univie.ac.at/thesaurus/concept/6">=0A> =A0 =A0 =A0 <skos:prefLabel>United=
=0A> Nations</skos:prefLabel>=0A> =A0 =A0 =A0=0A> <skos:altLabel>UN</skos:a=
ltLabel>=0A>  </skos:Concept>=0A> =0A> At the end the analyzer should add t=
he term UN at the right=0A> position in the index. Taking the example above=
, a sentence=0A> "I work for the United Nations" should appear in the index=
=0A> as =0A> =0A> 2: [work: 2-> 6]=0A> 5: [united nations: 15->29] [un: 15-=
>29]=0A> =0A> ...so that a query "I work for the UN" also matches the=0A> d=
ocument.=0A> =0A> What is the best solution to implement that. With a=0A> T=
okenFilter I can work through the sentence token by token=0A> (using increm=
entToken()) and check if there is a synonym=0A> available. How can I analyz=
e token sequences in a given=0A> text? Do I need to implement a custom toke=
nizer that=0A> recognizes entities based on a given dictionary?=0A> =0A> I =
am grateful for any suggestions or advice.=0A=0Ahttp://wiki.apache.org/solr=
/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory can handle multi=
-word synonyms. This may help.=0A=0A=0A      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org