lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: content disappears in the index
Date Mon, 12 Nov 2012 14:04:41 GMT
Yes, it is the second PatternReplaceFilterFactory.

the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a",
whereas the other strings are:
"Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" --> "alexanderkvambj"
"Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" --> "brennmoeningarhauk"

Now this explains the sorting (shit in --> shit out).

But why is the first string reduced to "a", wrong regular expression?

Bernd



Am 12.11.2012 14:51, schrieb Bernd Fehling:
> The field type is derived from the distributed alphaOnlySort as follows:
> 
> <fieldType name="alphaOnlySortLim" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>   <analyzer>
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.LowerCaseFilterFactory" />
>     <filter class="solr.TrimFilterFactory" />
>     <filter class="solr.PatternReplaceFilterFactory" pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
>                                                      replacement="" replace="all"/>
>     <filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30})(.{31,})"
>                                                      replacement="$1" replace="all"/>
>   </analyzer>
> </fieldType>
> 
> It reduces long lists of author names (100 and more authors) to the first 30 chars
> for sorting and removes some illegal chars to keep sorting with utf8 solid.
> 
> Don't see any problems there.
> 
> Will check with admin/analysis page.
> 
> Bernd
> 
> 
> Am 12.11.2012 14:28, schrieb Erick Erickson:
>> First, sorting on tokenized fields is undefined/unsupported. You _might_
>> get away with it if the author field always reduces to one token, i.e. if
>> you're always indexing only the last name.
>>
>> I should say unsupported/undefined when more than one token is the result
>> of analysis. You can do things like use the KeywordTokenizer followed by
>> tranformations on the _entire_ input field (lowercasing is popular for
>> instance).
>>
>> So somehow the analysis chain you have defined for this field grabs
>> "Arslanagic"
>> and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?
>>
>> The fastest way to look at that would be in Solr's admin/analysis page.
>> Just put Arslanagic into the index box and you should see which of the
>> steps does the translation. Although changing it to "a" is really weird,
>> it's almost certainly something you've defined in the indexing analysis
>> chain.
>>
>> FWIW,
>> Erick
>>
>>
>> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
>> bernd.fehling@uni-bielefeld.de> wrote:
>>
>>> Hi list,
>>> a user reported wrong sorting of our search service running on solr.
>>> While chasing this issue I traced it back through lucene into the index.
>>> I have a text field for sorting
>>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
>>> and three docs with author names.
>>>
>>> If I trace at org.apache.lucene.document.Document.add(IndexableField) while
>>> indexing I can see all three author names added as field to each documents.
>>>
>>> After searching with *:* for the three docs and doing a sort the sorting
>>> is wrong
>>> because one of the author names is reduced to the first char, all other
>>> chars are lost.
>>>
>>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
>>> the result
>>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
>>> But this happens because the author "Arslanagic" is reduced to "a" during
>>> indexing (???)
>>> and if sorted "a" is before "alexander".
>>>
>>> Currently I use 4.0 but have the same issue with 3.6.1.
>>>
>>> Without tracing through tons of code:
>>> - which is the last breakpoint for debugging to see the docs right before
>>> they go into the index
>>> - which is the first breakpoint for debugging to see the docs coming right
>>> out of the index
>>>
>>> Regards
>>> Bernd
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message