lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: content disappears in the index
Date Tue, 13 Nov 2012 06:47:59 GMT
Hi Erik,

I like the fortune cookie :-)

I came to the same solution as you did but with a short java proggy by
trying different patterns, so try and error ;-)

This brings me to the question, is there now (with 4.0) any filter doing
the job for me? I took a look at LengthFilter but it has a different purpose.
And TrimFilter has also a different usage.
By the way, why does TrimFilter option updateOffset defaults to false,
just keep it backwards compatible?

Thanks for your help,
Bernd


Am 13.11.2012 02:16, schrieb Erick Erickson:
> Because your regex is wrong? (sorry, couldn't resist).
> 
> Regexes always give me indigestion. But if you look at your results, your
> regex isn't working in any case at all. The second group is being removed
> from the end of the string. I _think_ what's happening is that the longest
> possible string is being matched (which will usually be your second group).
> Then from what's left, your first group is being captured. If you look at
> what you have above, none of the matches is 31 characters long. I don't
> think you need the second group at all.
> 
> This works for me:
> <filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30}).*"
>                                                      replacement="$1"
> replace="all"/>
> 
> This pattern works too: pattern="^(.{1,30}).*"
> 
> But like I said, I'm no expert with regex'es, I usually have to fumble
> around quite a bit to get what I want.
> 
> Found in a fortune cookie according to legend:
> "A programmer had a problem. He solved it with regular expressions. Now he
> has two problems".
> 
> 
> 
> 
> On Mon, Nov 12, 2012 at 9:04 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Yes, it is the second PatternReplaceFilterFactory.
>>
>> the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a",
>> whereas the other strings are:
>> "Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" -->
>> "alexanderkvambj"
>> "Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" -->
>> "brennmoeningarhauk"
>>
>> Now this explains the sorting (shit in --> shit out).
>>
>> But why is the first string reduced to "a", wrong regular expression?
>>
>> Bernd
>>
>>
>>
>> Am 12.11.2012 14:51, schrieb Bernd Fehling:
>>> The field type is derived from the distributed alphaOnlySort as follows:
>>>
>>> <fieldType name="alphaOnlySortLim" class="solr.TextField"
>> sortMissingLast="true" omitNorms="true">
>>>   <analyzer>
>>>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>     <filter class="solr.LowerCaseFilterFactory" />
>>>     <filter class="solr.TrimFilterFactory" />
>>>     <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
>>>                                                      replacement=""
>> replace="all"/>
>>>     <filter class="solr.PatternReplaceFilterFactory"
>> pattern="(.{1,30})(.{31,})"
>>>                                                      replacement="$1"
>> replace="all"/>
>>>   </analyzer>
>>> </fieldType>
>>>
>>> It reduces long lists of author names (100 and more authors) to the
>> first 30 chars
>>> for sorting and removes some illegal chars to keep sorting with utf8
>> solid.
>>>
>>> Don't see any problems there.
>>>
>>> Will check with admin/analysis page.
>>>
>>> Bernd
>>>
>>>
>>> Am 12.11.2012 14:28, schrieb Erick Erickson:
>>>> First, sorting on tokenized fields is undefined/unsupported. You _might_
>>>> get away with it if the author field always reduces to one token, i.e.
>> if
>>>> you're always indexing only the last name.
>>>>
>>>> I should say unsupported/undefined when more than one token is the
>> result
>>>> of analysis. You can do things like use the KeywordTokenizer followed by
>>>> tranformations on the _entire_ input field (lowercasing is popular for
>>>> instance).
>>>>
>>>> So somehow the analysis chain you have defined for this field grabs
>>>> "Arslanagic"
>>>> and translates it into "a". Synonyms? Stemming? Some "interesting"
>> sequence?
>>>>
>>>> The fastest way to look at that would be in Solr's admin/analysis page.
>>>> Just put Arslanagic into the index box and you should see which of the
>>>> steps does the translation. Although changing it to "a" is really weird,
>>>> it's almost certainly something you've defined in the indexing analysis
>>>> chain.
>>>>
>>>> FWIW,
>>>> Erick
>>>>
>>>>
>>>> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
>>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>>
>>>>> Hi list,
>>>>> a user reported wrong sorting of our search service running on solr.
>>>>> While chasing this issue I traced it back through lucene into the
>> index.
>>>>> I have a text field for sorting
>>>>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
>>>>> and three docs with author names.
>>>>>
>>>>> If I trace at org.apache.lucene.document.Document.add(IndexableField)
>> while
>>>>> indexing I can see all three author names added as field to each
>> documents.
>>>>>
>>>>> After searching with *:* for the three docs and doing a sort the
>> sorting
>>>>> is wrong
>>>>> because one of the author names is reduced to the first char, all other
>>>>> chars are lost.
>>>>>
>>>>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
>>>>> the result
>>>>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is
>> wrong.
>>>>> But this happens because the author "Arslanagic" is reduced to "a"
>> during
>>>>> indexing (???)
>>>>> and if sorted "a" is before "alexander".
>>>>>
>>>>> Currently I use 4.0 but have the same issue with 3.6.1.
>>>>>
>>>>> Without tracing through tons of code:
>>>>> - which is the last breakpoint for debugging to see the docs right
>> before
>>>>> they go into the index
>>>>> - which is the first breakpoint for debugging to see the docs coming
>> right
>>>>> out of the index
>>>>>
>>>>> Regards
>>>>> Bernd
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message