lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Lecarme <math...@garambrogne.net>
Subject Re: [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
Date Sun, 02 Mar 2008 13:16:22 GMT
hum, quote and question disappear.

Le 2 mars 08 à 13:32, Mathieu Lecarme (JIRA) a écrit :

>
>    [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574214

> #action_12574214 ]
>
> Mathieu Lecarme commented on LUCENE-1190:
> -----------------------------------------
>
>
 >> For example, I don't know what you mean by "Some Lucene features  
need a list of referring word".  Do you mean "a list of associated  
words"?

> With a FuzzyQuery, for example, you iterate over Term in index, and
> looking for the nearest one. PrefixQuery or regular expression work in
> a similar way.
> If you say, fuzzy querying will never gives a word with different size
> of 1 (size+1 or size -1), you can restrict the list of candidates, and
> ngram index can help you more.
>
> Some token filter destroy the word. Stemmer for example. If you wont
> to search wide, stemmer can help you, but can't use PrefixQuery with
> stemmed word. So, you can stemme word in a lexicon and use it as a
> synonym. You index "dog" and look for "doggy",  "dogs" and "dog".
> Lexicon can use static list of word, from hunspell index or wikipedia
> parsing, or words extracted from your index.

 >> Each meta is a Field.... what do you mean by that?  Could you  
please give an example?
> for the word "Lucene" :
>
> word:lucene
> pop:42
> anagram.anagram:celnu
> aphone.start:LS
> aphone.gram:LS
> aphone.gram:SN
> aphone.end:SN
> aphone.size:3
> aphone.phonem:LSN
> ngram.start:lu
> ngram.gram:lu
> ngram.gram:uc
> ngram.gram:ce
> ngram.gram:en
> ngram.gram:ne
> ngram.end:ne
> ngram.size:6
> stemmer.stem:lucen
>
>

 >> Hm, not sure I know what you mean.  Are you saying that once you  
create a sufficiently large lexicon/dictionary/index, the number of  
new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law 
  )
> Yes.
>
>> a lexicon object for merging spellchecker and synonyms from stemming
>> --------------------------------------------------------------------
>>
>>                Key: LUCENE-1190
>>                URL: https://issues.apache.org/jira/browse/LUCENE-1190
>>            Project: Lucene - Java
>>         Issue Type: New Feature
>>         Components: contrib/*, Search
>>   Affects Versions: 2.3
>>           Reporter: Mathieu Lecarme
>>        Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>>
>>
>> Some Lucene features need a list of referring word. Spellchecking  
>> is the basic example, but synonyms is an other use. Other tools can  
>> be used smoothlier with a list of words, without disturbing the  
>> main index : stemming and other simplification of word (anagram,  
>> phonetic ...).
>> For that, I suggest a Lexicon object, wich contains words (Term +  
>> frequency), wich can be built from Lucene Directory, or plain text  
>> files.
>> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and  
>> ISOLatin1AccentFilter should be the most useful).
>> Lexicon uses a Lucene Directory, each Word is a Document, each meta  
>> is a Field (word, ngram, phonetic, fields, anagram, size ...).
>> Above a minimum size, number of differents words used in an index  
>> can be considered as stable. So, a standard Lexicon (built from  
>> wikipedia by example) can be used.
>> A similarTokenFilter is provided.
>> A spellchecker will come soon.
>> A fuzzySearch implementation, a neutral synonym TokenFilter can be  
>> done.
>> Unused words can be remove on demand (lazy delete?)
>> Any criticism or suggestions?
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message