lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"
Date Mon, 22 Apr 2013 15:08:43 GMT
I would put this in 4.3. This is the first release with the position fix for edge ngrams, so
it would make sense to fix it all the way, rather than have two different levels of fix in
two different releases.

wunder

On Apr 22, 2013, at 6:17 AM, Simon Willnauer wrote:

> I think we can add this to 4.3 I can roll another RC for that.
> 
> simon
> 
> On Mon, Apr 22, 2013 at 3:11 PM, Jack Krupansky <jack@basetechnology.com> wrote:
>> Is this a fix to 4.3 (RC3?) or for a 4.3.1?
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Steve Rowe
>> Sent: Monday, April 22, 2013 2:07 AM
>> 
>> To: dev@lucene.apache.org
>> Subject: Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"
>> 
>> I've reopened LUCENE-4810 and attached a patch with a test and fix for this
>> problem. - Steve
>> 
>> On Apr 22, 2013, at 1:09 AM, Steve Rowe <sarowe@gmail.com> wrote:
>> 
>>> Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces
>>> surface forms.  Not really so incompatible, I think.
>>> 
>>> Regardless of the choice to use this particular sequence of filters,
>>> EdgeNGramTokenFilter shouldn't produce a bad stream.
>>> 
>>> Steve
>>> 
>>> On Apr 21, 2013, at 8:34 PM, Walter Underwood <wunder@wunderwood.org>
>>> wrote:
>>> 
>>>> Don't use a stemmer with edge ngrams.
>>>> 
>>>> Edge ngrams are a tool for matching the surface word. Stemmers are a tool
>>>> for matching the root. Those are logically incompatible transforms.
>>>> 
>>>> wunder
>>>> 
>>>> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:
>>>> 
>>>>> Karol has uncovered a bug introduced by LUCENE-4810
>>>>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in
Lucene/Solr
>>>>> 4.3.0.
>>>>> 
>>>>> The problem is an interaction between the Morfologik stemmer, which can
>>>>> produce multiple stems per input term, all but the first having a position
>>>>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams
for
>>>>> input terms that are at least as long as the minimum configured length,
and
>>>>> passes through unchanged the position increment for the first ngram output
>>>>> for any given input term.
>>>>> 
>>>>> So what happens in Karol's case is that "T." has the period stripped
by
>>>>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to",
>>>>> "tom" and "tona".  The first term "to" has a position increment of 1,
but is
>>>>> not output by EdgeNGramTokenFilter, because it's length is below the
>>>>> configured minimum of 3.  The second term "tom" is given a position
>>>>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum
>>>>> length, so gets output, and since it's the first output term for the
input
>>>>> term "tom", the input position increment is left as-is in the output
term:
>>>>> 0.  That's how the first output term gets a position increment of 0.
>>>>> 
>>>>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0,
>>>>> EdgeNGramTokenFilter indiscriminately set all output terms' position
>>>>> increments to 1, so that explains why this behavior didn't occur with
>>>>> previously released versions.
>>>>> 
>>>>> I think the fix is a check in EdgeNGramTokenFilter when outputting the
>>>>> first term, that the position increment is greater than 0, and if it's
not,
>>>>> then it should be set it to 1.
>>>>> 
>>>>> Does anybody know if this could also be an issue for other filters?
>>>>> 
>>>>> I'll work on a patch for EdgeNGramTokenFilter.
>>>>> 
>>>>> Steve
>>>>> 
>>>>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <karol.sikora@laboratorium.ee>
>>>>> wrote:
>>>>> 
>>>>>> hi,
>>>>>> 
>>>>>> I extracted minimal failing example, solr configs(schema,
>>>>>> solrconfig.xml) and data are in attached archive.
>>>>>> I try to import simple document:
>>>>>> [
>>>>>>  {
>>>>>>      "publisher": [
>>>>>>          "T. Gl\u00fccksberg"
>>>>>>      ],
>>>>>>      "uid": "1000881"
>>>>>>  },
>>>>>>  {
>>>>>>      "publisher": [
>>>>>>    "Ala a kota"
>>>>>>      ],
>>>>>>      "uid": "1000894"
>>>>>>  }
>>>>>> ]
>>>>>> first fails on copyfield destination publisher_hl with exception
>>>>>> (trace: https://gist.github.com/anonymous/5429558), second is added
without
>>>>>> any problems.
>>>>>> schema.xml is here: https://gist.github.com/anonymous/5429562
>>>>>> 
>>>>>> When someone will trying to reproduce this behaviour remember to
copy
>>>>>> libs related with morfologik and icu filters.
>>>>>> 
>>>>>> This extracted example works fine with solr 4.0 - 4.2.1.
>>>>>> 
>>>>>> Regards,
>>>>>> Karol
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze:
>>>>>>> 
>>>>>>> hey karol,
>>>>>>> 
>>>>>>> can you reproduce this behaviour in a small test-case (curl command
or
>>>>>>> something like this) that we can reproduce?
>>>>>>> 
>>>>>>> @solr guys any idea what this could be?
>>>>>>> 
>>>>>>> simon
>>>>>>> 
>>>>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora
>>>>>>> 
>>>>>>> <karol.sikora@laboratorium.ee>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> I have problem with solr 4.3 RC2 on my testing data for searching
>>>>>>>> application which i'm developing.
>>>>>>>> A lot of importing records fails with exception
>>>>>>>> "java.lang.IllegalArgumentException: first position increment
must be
>>>>>>>>> 0
>>>>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents
was added
>>>>>>>> successfully, so I'm thinking that something is broken in
new
>>>>>>>> release.
>>>>>>>> I'll try examine tomorrow what is broken.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Karol
>>>>>>>> 
>>>>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Here is the RC:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> happy voting...
>>>>>>>>>> 
>>>>>>>>>> here is my +1
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> PyLucene 4.3 builds and passes its tests.
>>>>>>>>> 
>>>>>>>>> +1 !
>>>>>>>>> 
>>>>>>>>> Andi..
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail:
>>>>>>>>> dev-unsubscribe@lucene.apache.org
>>>>>>>>> 
>>>>>>>>> For additional commands, e-mail:
>>>>>>>>> dev-help@lucene.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> --
>>>>>>>> Karol Sikora
>>>>>>>> +48 781 493 788
>>>>>>>> 
>>>>>>>> Laboratorium EE
>>>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>>>>>> 
>>>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail:
>>>>>>>> dev-unsubscribe@lucene.apache.org
>>>>>>>> 
>>>>>>>> For additional commands, e-mail:
>>>>>>>> dev-help@lucene.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>>>>>>> dev-unsubscribe@lucene.apache.org
>>>>>>> 
>>>>>>> For additional commands, e-mail:
>>>>>>> dev-help@lucene.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Karol Sikora
>>>>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0
>>>>>> +48 781 493 788
>>>>>> 
>>>>>> Laboratorium EE
>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>>>> 
>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> 
>>>> 
>>>> --
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

--
Walter Underwood
wunder@wunderwood.org




Mime
View raw message