lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject RE: Indexing synonyms
Date Mon, 11 Nov 2002 19:22:35 GMT
I always thought that WordNet was not accessible to general public.
Wrong?

Also, I'm curious - what would you use for storing synonyms?
Are you considering using a 'static', read-only Lucene index maybe?
An index that makes use of setPosition(0) calls to store synonyms like
this, for instance:

Document1:
  word: word1
        word1synonym1
        word1synonym2
        word1synonym3

...

DocumentN:
  word: wordN
      wordNsynonym1
      wordNsynonym2
      wordNsynonym3


Unless I am missing something, and if a synonym database is available,
this would be pretty easy to implement, no?

Otis





--- "Spencer, Dave" <dave@lumos.com> wrote:
> Re "reducing the set of question/answer pair to
> consider" below - I would expect that using synonyms either
> in the index or in the reformed query would (annoyingly) 
> increase the number of potential matches or is there
> something I'm missing.
> 
> Interesting that this topic just came up as I wanted to experiment
> w/ the same thing. My first stab at an public domain synonym
> list, the "moby" list, didn't seem to have synonyms however. 
> I believe another poster mentioned WordNet so I'll try that.
> 
> I'd really like it if it was possibly to automatically determine
> synonyms - maybe something similar to Latent Semantic Analysis - but
> such things seem kinda hard to code up...
> 
> 
> -----Original Message-----
> From: Aaron Galea [mailto:agale@nextgen.net.mt]
> Sent: Sunday, November 10, 2002 4:18 PM
> To: Lucene Users List; lists@lissus.com
> Subject: Re: Indexing synonyms
> 
> 
> Thanks for all your replies,
> 
> Well I will start of with an idea of what I am trying to achieve. I
> am
> building a question answer system and one of its modules is an FAQ
> Module.
> Since the QA system is concerned with education, users can
> concentrate
> their
> question on a particular subject reducing the set of question/answer
> pair to
> consider. Since there is this hierarchical indexing the index files
> are
> not
> that big so I can store synonyms for each word in a question in the
> index.
> Query expansion will solve the problem and eliminating the need to
> store
> synonyms in the index but this will slow things as there is no depth
> limit
> to consider for term expansion. It is not my intension to build
> something
> similar to the FAQFinder system but I want to further reduce the
> subset
> of
> questions to consider on which a question reformulation algorithm
> would
> be
> applied. Therefore the idea is get an faq file dealing with one
> education
> subject, index all of its questions and expand each term in the
> question.
> Using lucene I will retrieve the questions that are likely to be
> similar
> to
> a user question, select say the top 5 and apply a query reformulation
> algorithm. If this succeeds fine and I return the answer to user,
> otherwise
> submit the question to an answer extraction module. The most
> important
> thing
> is speed so putting term expansion in the index hopefully should
> improve
> things. Obviously problems arise with this method as there is no word
> sense
> disambiguation but the query reformulation algorithm will solve this.
> However it is slow so I must reduce the number of questions it is
> applied
> on. It is a tradeoff!!!
> 
> Well I managed to solve this by overriding the next() method and when
> it
> gets to an EOS I start returning the new expanded terms that I
> accumulated
> in a list.
> 
> Thanks everyone for your reply!!!!
> 
> Aaron
> 
> NB : And yep I am a Malteser Otis ! :)
> 
> 
> ----- Original Message -----
> From: "Alex Murzaku" <lists@lissus.com>
> To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
> Sent: Monday, November 11, 2002 12:17 AM
> Subject: RE: Indexing synonyms
> 
> 
> > You could also do something with org.apache.lucene.analyzer.Token
> which
> > includes the following self-explanatory note:
> >
> >   /** Set the position increment.  This determines the position of
> this
> > token
> >    * relative to the previous Token in a {@link TokenStream}, used
> in
> > phrase
> >    * searching.
> >    *
> >    * <p>The default value is one.
> >    *
> >    * <p>Some common uses for this are:<ul>
> >    *
> >    * <li>Set it to zero to put multiple terms in the same position.
> > This is
> >    * useful if, e.g., a word has multiple stems.  Searches for
> phrases
> >    * including either stem will match.  In this case, all but the
> first
> > stem's
> >    * increment should be set to zero: the increment of the first
> > instance
> >    * should be one.  Repeating a token with an increment of zero
> can
> > also be
> >    * used to boost the scores of matches on that token.
> >    *
> >    * <li>Set it to values greater than one to inhibit exact phrase
> > matches.
> >    * If, for example, one does not want phrases to match across
> removed
> > stop
> >    * words, then one could build a stop word filter that removes
> stop
> > words and
> >    * also sets the increment to the number of stop words removed
> before
> > each
> >    * non-stop word.  Then exact phrase queries will only match when
> the
> > terms
> >    * occur with no intervening stop words.
> >    *
> >    * </ul>
> >    * @see TermPositions
> >    */
> >   public void setPositionIncrement(int positionIncrement) {
> >     if (positionIncrement < 0)
> >       throw new IllegalArgumentException
> >         ("Increment must be positive: " + positionIncrement);
> >     this.positionIncrement = positionIncrement;
> >   }
> >
> >
> > --
> > Alex Murzaku
> > ___________________________________________
> >  alex(at)lissus.com  http://www.lissus.com
> >
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> > Sent: Sunday, November 10, 2002 1:30 PM
> > To: Lucene Users List
> > Subject: Re: Indexing synonyms
> >
> >
> > .mt?  Malta?  That's rare! :)
> >
> > A person called Clemens Marschner just submitted diffs for query
> > rewriting to lucene-dev list 1-2 weeks ago.  The diffs are not in
> CVS
> > yet, and they are a bit old now becase the code they were made
> against
> > has changed since they were made. You could either try applying
> them
> > yourself, of waiting until they get applied and then you could get
> a
> > nightly build.
> >
> > Otis
> >
> > --- Aaron Galea <agale@nextgen.net.mt> wrote:
> > > Hi everyone,
> > >
> > > I need to create a filter that extends a tokenfilter whose
> purpose
> is
> > > to generate some synonyms for words in the document using
> Wordnet.
> > > Well searching for synonyms using wordnet is not that problematic
> but
> > > I need to add the synonym words to Lucene tokenstream before they
> are
> > > passed for indexing. However TokenStream class does not support
> any
> > > add method. Did anyone ever needed to do this? Can someone
> suggest
> an
> > > alternative of how to add some synonym words to the index?
> > >
> > > Thanks
> > > Aaron
> > >
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > U2 on LAUNCH - Exclusive greatest hits videos
> http://launch.yahoo.com/u2
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> > ---
> > [This E-mail was scanned for spam and viruses by NextGen.net.]
> >
> >
> >
> 
> 
> ---
> [This E-mail was scanned for spam and viruses by NextGen.net.]
> 
> 
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message