lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <scott.ta...@fuse.net>
Subject Re: Spell Check Handler
Date Sun, 14 Oct 2007 20:32:05 GMT
Matthew,

Thanks for the question.  The answer is that they come from your own indexes so the dictionary
is based upon the actual words that are already stored in Solr.  This makes sense; if the
spell checker is suggesting a word that is not in the Solr index, then it will not help the
user find what they are looking for.

You can control which fields in Solr can feed the spell checker.  Also you can have more than
one spell checker that is focused on a specific subjects.

The following example of a SpellCheckerRequestHandler is based upon the one I created for
the test case.  You need to add this to yor solrconfig.xml file.  You can view the whole thing
within the Solr source code once it is commited in to the main stream.  The path is:
/src/test/test-files/solr/conf/solrconfig-spellchecker.xml and schema-spellchecker.xml in
the same directory.

  <!-- SpellCheckerRequestHandler takes in a word (or several words) as the
       value of the "q" parameter and returns a list of alternative spelling
       suggestions.  If invoked with a ...&cmd=rebuild, it will rebuild the
       spellchecker index.
  -->
  <requestHandler name="spellchecker" class="solr.SpellCheckerRequestHandler" startup="lazy">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <int name="suggestionCount">20</int>
       <float name="accuracy">0.60</float>
     </lst>
     
     <!-- Main init params for handler -->
     
     <!-- The directory where your SpellChecker Index should live.   -->
     <!-- May be absolute, or relative to the Solr "dataDir" directory. -->
     <!-- If this option is not specified, a RAM directory will be used -->
     <str name="spellcheckerIndexDir">spell</str>
     
     <!-- the field in your schema that you want to be able to build -->
     <!-- your spell index on. This should be a field that uses a very -->
     <!-- simple FieldType without a lot of Analysis (ie: string) -->
     <str name="termSourceField">spell</str>
     
   </requestHandler>

Some comments:
  - The termSourceField should be a field you have defined within your solr schema file. 
See notes below about the use of this field.
  - The spellcheckeerIndexDir is the name of the directory that contain the spellchecker indexes.
 In my example, I used spell, and it will be at the same level of data and conf.  You can
name it what ever you would like to.
  - if you use the name of "/spellchecker" the url will be more RESTful
  - if you need to have more than one spell checker in use at a time, then you will need to
change the name, spellcheckerIndexDir, and termSourceField
  - If you have more than one spell checker hitting the same index directory, then when you
rebuild the index through one of the handlers the other handlers will not know it has been
reindexed.  To resolve this issue, you may have to restart Solr.  


The following components are from the schema-spellchecker.xml file:

	<fieldType name="spellText" class="solr.TextField" positionIncrementGap="100">
	  <analyzer type="index">
	    <tokenizer class="solr.StandardTokenizerFactory"/>
	    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
	    <filter class="solr.StandardFilterFactory"/>
	    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	  </analyzer>
	  <analyzer type="query">
	    <tokenizer class="solr.StandardTokenizerFactory"/>
	    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
	    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
	    <filter class="solr.StandardFilterFactory"/>
	    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	  </analyzer>
	</fieldType>


   <field name="spell" type="spellText" indexed="true" stored="true" />



Some comments on Schema items above:
  - The fieldType must be contained within the types
  - The spellText content can be named what every you want
  - The spellText fieldType should not be too aggressive on stemming or modifying the the
contents of the field
  - Could use string instead of the defined fieldType of spellText, but it does not have to
be that restrictive

  - The field spellText needs to be within the "fields" group with your other defined fields
  - You could always use the copyField to either copy another fields content into your "spell"
field: 
      <copyField source="misc" dest="spell"/>


Some notes on the name of the handler:
  - If you precede the name with "/" you can use the following url instead of the second one:
  - using the name of "/spellchecker"
     http://yourSolrSite/solr/spellchecker?q=sialophosphoprotein 
  - using the name of "spellchecker"
    http://yourSolrSite/solr/select?qt=spellchecker&q=sialophosphoprotein


Matthew, I hope you find this somewhat helpful.

   Scott Tabar

---- Matthew Runo <mruno@zappos.com> wrote: 
Where does the index come from in the first place? Do we have to  
enter the words, or are they entered as documents enter the SOLR index?

I'd love to be able to use my own documents as the spell check index  
of "correctly spelled words".

+--------------------------------------------------------+
  | Matthew Runo
  | Zappos Development
  | mruno@zappos.com
  | 702-943-7833
+--------------------------------------------------------+


On Oct 11, 2007, at 7:08 AM, <scott.tabar@fuse.net>  
<scott.tabar@fuse.net> wrote:

> Climbingrose,
>
> I think you make a valid point.  Each person may have a different  
> concept of how something should work with their application.
>
> My thought on the subject of spell checking multiple words:
>   - the parameter "multiWords" enables spell checking on each word  
> in "q" parameter instead of on the whole field
>   - each word is then represented in its own entry in a list of all  
> words that are checked
>   - to identify each word that is being checked within that entry,  
> it is identified by the key "words"
>   - to identify if the word was found exactly as it is within the  
> spell checker's index, the "exist" key contains this information
>   - Since there can be suggestions for both misspelled words and  
> words that are spelled correctly, the list of suggestions is also  
> included for both correctly spelled and misspelled words, even if  
> the suggestion list is empty.
>
>   - My vision is that if a user has a search query of multiple  
> words and they are wanting to perform a check on the words, the use  
> of "multiWords" will check all words at one time, independently  
> from each others and return the list.  The presenting web app can  
> then identify visually to the user which words are misspelled and  
> which ones have suggestions too.  The user can then work with the  
> various lists of suggestions without having to re-hit Solr.   
> Naturally, if the user manually changes a word, then Solr will have  
> to be re-hit, but providing a single list of all words, including  
> suggestions for correct words along with incorrect words, will help  
> simplify applications (by reducing iterating over each word) and  
> will help reduce the number of hits to the Solr server.
>
>
>> 1) I assumpt that when user enter a misspelled multiword query, we  
>> should
>> only check for words that are actually misspelled. For example, if  
>> user
>> enter "life expectancy calculatar", which has "calculator"  
>> misspelled, we
>> should only spellcheck "calculatar".
>
> I think I understand what you mean in the above statement, but you  
> must admit, it does sound funny.  After all, how do you identify  
> that a word is misspelled by NOT using the spelling checker?   
> Correct me if I am wrong, but I think you intended to say that when  
> a word is identified as being misspelled, then you should only  
> include the suggestions for misspelled words.  If this is the case,  
> then I would have to disagree with you.  The user may be interested  
> in finding words that might mean the same, but are more popular  
> (appears in more indexed documents within the Lucene index).  Hence  
> the reason why I added the result field "exist" to identify that a  
> word is spelled correctly even if there is a list of suggestions.   
> Please note, the situation can exist too where a word is misspelled  
> and there are no suggestions so one cannot use the suggestion list  
> as an indicator to the correctness of the individual word(s).
>
>
>> 2) I only return the best string for a mispelled query.
>
> You can also use the parameter "suggestionCount=1" to control how  
> many words are returned.  In this case, it will do what your code  
> is doing, but still allow the client to dynamically change this  
> value without the need to hard code it within the main source code.
>
>
> As far as only including terms that are more popular than the word  
> that is being checked, there is already a parameter  
> "onlyMorePopular" that you can use to dynamically control this  
> feature from the client side so it does not have to be hard coded  
> within the spelling checker.
>
> Review these parameter options on the wiki, but keep in mind I have  
> not updated the wiki with my changes or the new parameter and  
> result fields:
> http://wiki.apache.org/solr/SpellCheckerRequestHandler
>
>    Thanks Climbingrose,
>
>      Scott Tabar
>
>
>
>
> ---- climbingrose <climbingrose@gmail.com> wrote:
> Just to clarify this line of code:
>
> String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
> req.getSearcher().getReader(), restrictToField, true);
>
> I only return suggestions if they are more popular than termText. You
> probably need to use code in Scott's patch to make this behaviour
> configurable.
>
> On 10/11/07, climbingrose <climbingrose@gmail.com> wrote:
>>
>> Hi all,
>>
>> I've been so busy the last few days so I haven't replied to this  
>> email. I
>> modified SpellCheckerHandler a while ago to include support for  
>> multiword
>> query. To be honest, I didn't have time to write unit test for the  
>> code.
>> However, I deployed it in a production environment and it has been  
>> working
>> for me so far. My version, however, has two assumptions:
>>
>> 1) I assumpt that when user enter a misspelled multiword query, we  
>> should
>> only check for words that are actually misspelled. For example, if  
>> user
>> enter "life expectancy calculatar", which has "calculator"  
>> misspelled, we
>> should only spellcheck "calculatar".
>> 2) I only return the best string for a mispelled query.
>>
>> I guess I can just directly paste the code here so that others can  
>> adapt
>> for their own purposes. If you have any question, just send me an  
>> email.
>> I'll happy to help  you.
>>
>>         StringBuffer buf = null;
>>         if (null != words && !"".equals(words.trim())) {
>>             Analyzer analyzer = req.getSchema
>> ().getField(field).getType().getAnalyzer();
>>
>>             TokenStream source = analyzer.tokenStream(field, new
>> StringReader(words));
>>             Token t;
>>             boolean hasSuggestion = false;
>>             boolean termExists = false;
>>             while (true) {
>>                 try {
>>                     t = source.next();
>>                 } catch (IOException e) {
>>                     t = null;
>>                 }
>>                 if (t == null)
>>                     break;
>>
>>                 String termText = t.termText();
>>                 String[] suggestions = spellChecker.suggestSimilar 
>> (termText,
>> numSug, req.getSearcher().getReader(), restrictToField, true);
>>                 if (suggestions != null && suggestions.length > 0) {
>>                     if (!suggestions[0].equals(termText)) {
>>                         hasSuggestion = true;
>>                     }
>>                     if (buf == null) {
>>                         buf = new StringBuffer(suggestions[0]);
>>                     } else
>>                         buf.append(" ").append(suggestions[0]);
>>                 } else if (spellChecker.exist(termText)){
>>                     termExists = true;
>>                     if (buf == null) {
>>                         buf = new StringBuffer(termText);
>>                     } else
>>                         buf.append(" ").append(termText);
>>                 } else {
>>                     hasSuggestion = false;
>>                     termExists= false;
>>                     break;
>>                 }
>>             }
>>             try {
>>                 source.close();
>>             } catch (IOException e) {
>>                 // ignore
>>             }
>>             // String[] suggestions = spellChecker.suggestSimilar 
>> (words,
>> numSug,
>>             // nullReader, restrictToField, onlyMorePopular);
>>             if (hasSuggestion || (!hasSuggestion && termExists))
>>                 rsp.add("suggestions", buf.toString());
>>             else
>>                 rsp.add("suggestions", null);
>>
>>
>>
>> On 10/11/07, scott.tabar@fuse.net <scott.tabar@fuse.net> wrote:
>>>
>>> Hoss,
>>>
>>> I had a feeling someone would be quoting Yonik's Law of  
>>> Patches!  ;-)
>>>
>>> For now, this is done.
>>>
>>> I created the changes, created JavaDoc comments on the various  
>>> settings
>>> and their expected output, created a JUnit test for the
>>> SpellCheckerRequestHandler
>>> which tests various components of the handler, and I also created  
>>> the
>>> supporting configuration files for the JUnit tests (schema and
>>> solrconfig files).
>>>
>>> I attached the patch to the JIRA issue so now we just have to  
>>> wait until
>>> it gets
>>> added back in to the main code stream.
>>>
>>> For anyone who is interested, here is a link to the JIRA:
>>> https://issues.apache.org/jira/browse/SOLR-375
>>>
>>> Could someone please drop me a hint on how to update the wiki or any
>>> other
>>> documentation that could benefit to being updated; I'll like to  
>>> help out
>>> as much
>>> as possible, but first I need to know "how". ;-)
>>>
>>> When these changes do get committed back in to the daily build,  
>>> please
>>> review the generated JavaDoc for information on how to utilize  
>>> these new
>>> features.
>>> If anyone has any questions, or comments, please do not hesitate  
>>> to ask.
>>>
>>>
>>> As a general note of a self-critique on these changes, I am not 100%
>>> sure of the way I
>>> implemented the "nested" structure when the "multiWords"  
>>> parameter is
>>> used.  My interest
>>> is that it should work smoothly with some other technology such as
>>> Prototype using the
>>> JSon output type.  Unfortunately, I will not be getting a chance to
>>> start on that coding until
>>> next week so it is up in the air as to if this structure will be
>>> conducive or not.  I am planning
>>> on providing more details in the documentations as far as how to  
>>> utilize
>>> these modifications
>>> in Prototype and AJax when I get a chance (even provide links to a
>>> production site so you
>>> can see it in action and view the source if interested).  So stay
>>> tuned...
>>>
>>>    Thanks for everyones time,
>>>       Scott Tabar
>>>
>>> ---- Chris Hostetter <hossman_lucene@fucit.org> wrote:
>>>
>>> : If you like, I can post the source code changes that I made to the
>>> : SpellCheckerRequestHandler, but at this time I am not ready to  
>>> open a
>>> : JIRA issue and submit the changes back through the subversion.   
>>> I will
>>> : need to do a little more testing, documentation, and create  
>>> some unit
>>> : tests to cover all of these changes, but what I have been able to
>>> : perform, it is working very well.
>>>
>>> Keep in mind "Yonik's Law Of Patches" ...
>>>
>>>         "A half-baked patch in Jira, with no documentation, no tests
>>>         and no backwards compatibility is better than no patch at  
>>> all."
>>>         http://wiki.apache.org/solr/HowToContribute
>>>
>>> ...even if you don't think the code is "solid" yet, if you want to
>>> eventually make it available to people, making a "rough" version
>>> available
>>> to people early gives other people the opportunity to help you  
>>> make it
>>> solid (by writing unit tests, fixing bugs, and adding  
>>> documentation).
>>>
>>>
>>> -Hoss
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Cuong Hoang
>
>
>
>
> -- 
> Regards,
>
> Cuong Hoang
>



Mime
View raw message