lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 20:32:34 GMT
there is some information on this topic in the pkg summary:

http://lucene.apache.org/java/2_9_0/api/contrib-analyzers/org/apache/lucene/analysis/compound/package-summary.html

in short, for a large list (there is no limit in the code), you will want to
make use of a hyphenation grammar as well:
HyphenationCompoundWordTokenFilter instead of the brute-force dictionary
approach, for better speed.

there is also a pointer to some dictionaries at openoffice, i'd also look
around at spellcheckers and stuff too elsewhere if you cant find one that
fits your needs.

On Wed, Oct 21, 2009 at 4:19 PM, Paul Libbrecht <paul@activemath.org> wrote:

> Great,
>
> now the next question: which dictionary to do you guys use? How big can it
> be?
> Is 50000 words acceptable?
>
> paul
>
>
> Le 21-oct.-09 à 21:23, Robert Muir a écrit :
>
>
>  Paul, i think in general scoring should take care of this too, its all
>> about
>> your dictionary, same as the previous example.
>> this is because überwachungsgesetz matches 3 tokens: überwachungsgesetz,
>> überwachung, gesetz
>> but überwachung gesetz only matches 2.
>>
>> überwachungsgesetz
>> 0.37040412 = (MATCH) sum of:
>>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>  0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>>   0.5 = queryWeight(field:überwachung), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>     1.0 = tf(termFreq(field:überwachung)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>  0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.5 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.5 = fieldNorm(field=field, doc=0)
>>
>> überwachung gesetz
>> 0.30685282 = (MATCH) sum of:
>>  0.15342641 = (MATCH) sum of:
>>   0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>>     0.5 = queryWeight(field:überwachung), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>       1.0 = tf(termFreq(field:überwachung)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>   0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>>     0.5 = queryWeight(field:überwachung), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>       1.0 = tf(termFreq(field:überwachung)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>  0.15342641 = (MATCH) sum of:
>>   0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>>     0.5 = queryWeight(field:gesetz), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>       1.0 = tf(termFreq(field:gesetz)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>   0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>>     0.5 = queryWeight(field:gesetz), product of:
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       1.6294457 = queryNorm
>>     0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>       1.0 = tf(termFreq(field:gesetz)=1)
>>       0.30685282 = idf(docFreq=1, maxDocs=1)
>>       0.5 = fieldNorm(field=field, doc=0)
>>
>> On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht <paul@activemath.org>
>> wrote:
>>
>>  Can the dictionary have weights?
>>>
>>> überwachungsgesetz alone probably needs a higher rank than überwachung
>>> and
>>> gesetzt or?
>>>
>>> paul
>>>
>>>
>>> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>>>
>>>
>>> OK, that makes sense. So I just need to add all of the sub-compounds that
>>>
>>>> are real words at posIncr=0, even if they are combinations of other
>>>> sub-compounds.
>>>>
>>>> Thanks!
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Wednesday, October 21, 2009 11:49 AM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Using org.apache.lucene.analysis.compound
>>>>
>>>> yes, your dictionary :)
>>>>
>>>> if überwachungsgesetz is a real word, add it to your dictionary.
>>>>
>>>> for example, if your dictionary is { "Rind", "Fleisch", "Draht",
>>>> "Schere",
>>>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>>>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>>>> score.
>>>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>>>> "Schere",
>>>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this
>>>> makes
>>>> a big difference.
>>>>
>>>> all 3 queries will still match, but überwachungsgesetz will have a
>>>> higher
>>>> score. this is because now things are analyzed differently:
>>>> Rindfleischüberwachungsgesetz will be decompounded as before, but with
>>>> an
>>>> additional token: Überwachungsgesetz.
>>>> so back to your original question, these 'concatenations' of multiple
>>>> components, yes compounds will do that, if they are real words. but it
>>>> won't
>>>> just make them up.
>>>>
>>>> "überwachungsgesetz"
>>>> 0.23013961 = (MATCH) sum of:
>>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>>> product
>>>> of:
>>>>   1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>>>  0.5 = queryWeight(field:überwachung), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>>>   1.0 = tf(termFreq(field:überwachung)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>>> product
>>>> of:
>>>>   1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>>>  0.5 = queryWeight(field:gesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   1.6294457 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>>   1.0 = tf(termFreq(field:gesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>> "gesetzüberwachung"
>>>> 0.064782135 = (MATCH) sum of:
>>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>>   1.0 = tf(termFreq(field:gesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>>>  0.2814906 = queryWeight(field:überwachung), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>>>   1.0 = tf(termFreq(field:überwachung)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>> "fleischgesetz"
>>>> 0.064782135 = (MATCH) sum of:
>>>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>>>  0.2814906 = queryWeight(field:fleisch), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>>>   1.0 = tf(termFreq(field:fleisch)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.9173473 = queryNorm
>>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>>   1.0 = tf(termFreq(field:gesetz)=1)
>>>>   0.30685282 = idf(docFreq=1, maxDocs=1)
>>>>   0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>>>> <bbdouglas@basistech.com>wrote:
>>>>
>>>> Thanks for all of the answers so far!
>>>>
>>>>>
>>>>> Paul's question is similar to another aspect I am curious about:
>>>>>
>>>>> Given the way the sample word is analyzed, is there anything in the
>>>>> scoring
>>>>> mechanism that would rank "überwachungsgesetz" higher than
>>>>> "gesetzüberwachung" or "fleischgesetz"?
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message