lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <>
Subject Re: Similarity
Date Tue, 23 Jun 2009 09:50:16 GMT
I suspect what you are looking for is "Latent Semantics" - it can
algorithmically infer that "iPod~iPhone" or "Apple~Steve Jobs". Google for
"Latent Semantic Indexing" or "Latent Semantic Analysis" - you can apply
some of those approaches using the TermVectors in Lucene index.
Ontologies such as WordNet are very generic, hence if you have a domain
specific corpus, you would need to generate some kind of Latent Semantic
Index to extract the relations therein.

On Tue, Jun 23, 2009 at 5:27 AM, Cool The Breezer

> Of the late I started using Lucene as main search library for all documents
> in our intranet. It works extremely well. I am trying to use similarity
> kinda functionality to find similarity between two sentences/documents and
> trying to use Wordnet in our searching solution. I have used wordnet contrib
> package and it really works well to expand queries with synonyms and get
> results. But I can get handicap when searching for documents with query like
> "Steve Jobs" and documents containing "apple" should be returned. In the
> same way "pirated" and "willfull downloading copyrighted material". This
> comes finding meaning of a word wrt its context. Has anybody done any kind
> of such context based indexing that means while tokenization based on
> context of each word/token and searching the same after expanding the query
> using synonyms. I have come across some sf projects like
>  to semantically relating words
> using wordnet but I am
>  still kinda confused on how to move ahead with such kind of context based
> search. Appreciate your help. I understand that this might not be directly
> related to Lucene but somehow this falls in the same domain search solution.
> - RB
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message