lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eswar K" <kja.es...@gmail.com>
Subject Re: LSA Implementation
Date Wed, 28 Nov 2007 09:17:59 GMT
Lance,

It does cover European languages, but pretty much nothing on Asian languages
(CJK).

- Eswar

On Nov 28, 2007 1:51 AM, Norskog, Lance <lance@divvio.com> wrote:

> WordNet itself is English-only. There are various ontology projects for
> it.
>
> http://www.globalwordnet.org/ is a separate world language database
> project. I found it at the bottom of the WordNet wikipedia page. Thanks
> for starting me on the search!
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 26, 2007 6:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> The languages also include CJK :) among others.
>
> - Eswar
>
> On Nov 27, 2007 8:16 AM, Norskog, Lance <lance@divvio.com> wrote:
>
> > The WordNet project at Princeton (USA) is a large database of
> synonyms.
> > If you're only working in English this might be useful instead of
> > running your own analyses.
> >
> > http://en.wikipedia.org/wiki/WordNet
> > http://wordnet.princeton.edu/
> >
> > Lance
> >
> > -----Original Message-----
> > From: Eswar K [mailto:kja.eswar@gmail.com]
> > Sent: Monday, November 26, 2007 6:34 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: LSA Implementation
> >
> > In addition to recording which keywords a document contains, the
> > method examines the document collection as a whole, to see which other
>
> > documents contain some of those same words. this algo should consider
> > documents that have many words in common to be semantically close, and
>
> > ones with few words in common to be semantically distant. This simple
> > method correlates surprisingly well with how a human being, looking at
>
> > content, might classify a document collection. Although the algorithm
> > doesn't understand anything about what the words *mean*, the patterns
> > it notices can make it seem astonishingly intelligent.
> >
> > When you search an such  an index, the search engine looks at
> > similarity values it has calculated for every content word, and
> > returns the documents that it thinks best fit the query. Because two
> > documents may be semantically very close even if they do not share a
> > particular keyword,
> >
> > Where a plain keyword search will fail if there is no exact match,
> > this algo will often return relevant documents that don't contain the
> > keyword at all.
> >
> > - Eswar
> >
> > On Nov 27, 2007 7:51 AM, Marvin Humphrey <marvin@rectangular.com>
> wrote:
> >
> > >
> > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> > >
> > > > We essentially are looking at having an implementation for doing
> > > > search which can return documents having conceptually similar
> > > > words without necessarily having the original word searched for.
> > >
> > > Very challenging.  Say someone searches for "LSA" and hits an
> > > archived
> >
> > > version of the mail you sent to this list.  "LSA" is a reasonably
> > > discriminating term.  But so is "Eswar".
> > >
> > > If you knew that the original term was "LSA", then you might look
> > > for documents near it in term vector space.  But if you don't know
> > > the original term, only the content of the document, how do you know
>
> > > whether you should look for docs near "lsa" or "eswar"?
> > >
> > > Marvin Humphrey
> > > Rectangular Research
> > > http://www.rectangular.com/
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message