lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dan sutton <danbsut...@gmail.com>
Subject Re: Spellchecking and frequency
Date Wed, 28 Jul 2010 08:57:08 GMT
Hi Mark,

Thanks for that info looks very interesting, would be great to see your
code. Out of interest did you use the dictionary and the phonetic file? Did
you see better results with both?

In regards to the secondary part to check the corpus for matching
suggestions, would another way to do this is to have an event listener to
listen for commits, and then build the dictionary for matching corpus words
that way, then you avoid the performance hit at query time.

Cheers,
Dan

On Tue, Jul 27, 2010 at 7:04 PM, Mark Holland <mark.holland@zoopla.co.uk>wrote:

> Hi,
>
> I found the suggestions returned from the standard solr spellcheck not to
> be
> that relevant. By contrast, aspell, given the same dictionary and mispelled
> words, gives much more accurate suggestions.
>
> I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
> the java aspell library. I also extended the SpellCheckComponent to take
> the
> matrix of suggested words and query the corpus to find the first
> combination
> of suggestions which returned a match. This works well for my use case,
> where term frequency is irrelevant to spelling or scoring.
>
> I'd like to publish the code in case someone finds it useful (although it's
> a bit crude at the moment and will need a decent tidy up). Would it be
> appropriate to open up a Jira issue for this?
>
> Cheers,
> ~mark
>
> On 27 July 2010 09:33, dan sutton <danbsutton@gmail.com> wrote:
>
> > Hi,
> >
> > I've recently been looking into Spellchecking in solr, and was struck by
> > how
> > limited the usefulness of the tool was.
> >
> > Like most corpora , ours contains lots of different spelling mistakes for
> > the same word, so the 'spellcheck.onlyMorePopular' is not really that
> > useful
> > unless you click on it numerous times.
> >
> > I was thinking that since most of the time people spell words correctly
> why
> > was there no other frequency parameter that could enter into the score?
> > i.e.
> > something like:
> >
> > spell_score ~ edit_dist * freq
> >
> > I'm sure others have come across this issue and was wonding what
> > steps/algorithms they have used to overcome these limitations?
> >
> > Cheers,
> > Dan
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message