lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Fastest way to import a giant word list into Solr/Lucene?
Date Sat, 31 Oct 2015 03:13:59 GMT
Read the links I have sent.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 30, 2015, at 7:10 PM, Robert Oschler <robert.oschler@gmail.com> wrote:
> 
> Thanks Walter.  Are there any open source spell checkers that implement the
> Peter Norvig or Damerau-Levenshtein algorithms?  I'm short on time so I
> have to keep the custom coding down to a minimum.
> 
> 
> On Fri, Oct 30, 2015 at 8:02 PM, Walter Underwood <wunder@wunderwood.org>
> wrote:
> 
>> Dedicated spell-checkers have better algorithms than Solr. They usually
>> handle transposed characters as well as inserted, deleted, or substituted
>> characters. This is an enhanced version of Levinshtein distance. It is
>> called Damerau-Levenshtein and is too expensive to use in Solr search.
>> Spell correctors can also use a bigger distance than 2, unlike Solr.
>> 
>> The Peter Norvig corrector also handles words that have been run together.
>> The Norvig corrector has been translated to many different computer
>> languages.
>> 
>> The Norvig corrector is an interesting approach. It is well worth reading
>> this short article to learn more about spelling correction.
>> 
>> http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html
>>> 
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Oct 30, 2015, at 4:45 PM, Robert Oschler <robert.oschler@gmail.com>
>> wrote:
>>> 
>>> Hello Walter and Mikhail,
>>> 
>>> Thank you for your answers.  Do those spell checkers have the same or
>>> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
>>> distance 2)?  That's a critical requirement for my application.  I take
>> it
>>> by your suggestion of these spell checker apps they can easily be
>> extended
>>> with a user defined, supplementary dictionary, yes?
>>> 
>>> Thanks.
>>> 
>>> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
>>> mkhludnev@griddynamics.com> wrote:
>>> 
>>>> Perhaps
>>>> FileBasedSpellChecker
>>>> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>>>> 
>>>> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler <
>> robert.oschler@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> I have a gigantic list of industry terms that I want to import into a
>>>>> Solr/Lucene instance running on an AWS box.  What is the fastest way
to
>>>>> import the list into my Solr/Lucene instance?  I have admin/sudo
>>>> privileges
>>>>> on the box.
>>>>> 
>>>>> Also, is there a document that shows me how to set up my Solr/Lucene
>>>> config
>>>>> file to be optimized for fast searches on single word entries using
>> fuzzy
>>>>> search?  I intend to use this Solr/Lucene instance to do spell checking
>>>> on
>>>>> the big industry word list I mentioned above.  Each data record will
>> be a
>>>>> single word from the file.  I'll want to take a single word query and
>> do
>>>> a
>>>>> fuzzy search on the word against the index (Lichtenstein, max distance
>> 2
>>>> as
>>>>> per Solr/Lucene's fuzzy search feature).  So what parameters will
>>>> configure
>>>>> Solr/Lucene to be optimized for such a search?  Also, if a document
>> shows
>>>>> the best index/read parameters to support single word fuzzy searching
>>>> then
>>>>> that would be a big help too.  Note, the contents of the index will
>>>> change
>>>>> very infrequently if that affects the optimal parameter mix.
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks,
>>>>> Robert Oschler
>>>>> Twitter -> http://twitter.com/roschler
>>>>> http://www.RobotsRule.com/
>>>>> http://www.Robodance.com/
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> Principal Engineer,
>>>> Grid Dynamics
>>>> 
>>>> <http://www.griddynamics.com>
>>>> <mkhludnev@griddynamics.com>
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks,
>>> Robert Oschler
>>> Twitter -> http://twitter.com/roschler
>>> http://www.RobotsRule.com/
>>> http://www.Robodance.com/
>> 
>> 
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message