lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: content disappears in the index
Date Tue, 13 Nov 2012 01:16:17 GMT
Because your regex is wrong? (sorry, couldn't resist).

Regexes always give me indigestion. But if you look at your results, your
regex isn't working in any case at all. The second group is being removed
from the end of the string. I _think_ what's happening is that the longest
possible string is being matched (which will usually be your second group).
Then from what's left, your first group is being captured. If you look at
what you have above, none of the matches is 31 characters long. I don't
think you need the second group at all.

This works for me:
<filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30}).*"
                                                     replacement="$1"
replace="all"/>

This pattern works too: pattern="^(.{1,30}).*"

But like I said, I'm no expert with regex'es, I usually have to fumble
around quite a bit to get what I want.

Found in a fortune cookie according to legend:
"A programmer had a problem. He solved it with regular expressions. Now he
has two problems".




On Mon, Nov 12, 2012 at 9:04 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Yes, it is the second PatternReplaceFilterFactory.
>
> the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a",
> whereas the other strings are:
> "Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" -->
> "alexanderkvambj"
> "Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" -->
> "brennmoeningarhauk"
>
> Now this explains the sorting (shit in --> shit out).
>
> But why is the first string reduced to "a", wrong regular expression?
>
> Bernd
>
>
>
> Am 12.11.2012 14:51, schrieb Bernd Fehling:
> > The field type is derived from the distributed alphaOnlySort as follows:
> >
> > <fieldType name="alphaOnlySortLim" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> >   <analyzer>
> >     <tokenizer class="solr.KeywordTokenizerFactory"/>
> >     <filter class="solr.LowerCaseFilterFactory" />
> >     <filter class="solr.TrimFilterFactory" />
> >     <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
> >                                                      replacement=""
> replace="all"/>
> >     <filter class="solr.PatternReplaceFilterFactory"
> pattern="(.{1,30})(.{31,})"
> >                                                      replacement="$1"
> replace="all"/>
> >   </analyzer>
> > </fieldType>
> >
> > It reduces long lists of author names (100 and more authors) to the
> first 30 chars
> > for sorting and removes some illegal chars to keep sorting with utf8
> solid.
> >
> > Don't see any problems there.
> >
> > Will check with admin/analysis page.
> >
> > Bernd
> >
> >
> > Am 12.11.2012 14:28, schrieb Erick Erickson:
> >> First, sorting on tokenized fields is undefined/unsupported. You _might_
> >> get away with it if the author field always reduces to one token, i.e.
> if
> >> you're always indexing only the last name.
> >>
> >> I should say unsupported/undefined when more than one token is the
> result
> >> of analysis. You can do things like use the KeywordTokenizer followed by
> >> tranformations on the _entire_ input field (lowercasing is popular for
> >> instance).
> >>
> >> So somehow the analysis chain you have defined for this field grabs
> >> "Arslanagic"
> >> and translates it into "a". Synonyms? Stemming? Some "interesting"
> sequence?
> >>
> >> The fastest way to look at that would be in Solr's admin/analysis page.
> >> Just put Arslanagic into the index box and you should see which of the
> >> steps does the translation. Although changing it to "a" is really weird,
> >> it's almost certainly something you've defined in the indexing analysis
> >> chain.
> >>
> >> FWIW,
> >> Erick
> >>
> >>
> >> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
> >> bernd.fehling@uni-bielefeld.de> wrote:
> >>
> >>> Hi list,
> >>> a user reported wrong sorting of our search service running on solr.
> >>> While chasing this issue I traced it back through lucene into the
> index.
> >>> I have a text field for sorting
> >>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
> >>> and three docs with author names.
> >>>
> >>> If I trace at org.apache.lucene.document.Document.add(IndexableField)
> while
> >>> indexing I can see all three author names added as field to each
> documents.
> >>>
> >>> After searching with *:* for the three docs and doing a sort the
> sorting
> >>> is wrong
> >>> because one of the author names is reduced to the first char, all other
> >>> chars are lost.
> >>>
> >>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
> >>> the result
> >>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is
> wrong.
> >>> But this happens because the author "Arslanagic" is reduced to "a"
> during
> >>> indexing (???)
> >>> and if sorted "a" is before "alexander".
> >>>
> >>> Currently I use 4.0 but have the same issue with 3.6.1.
> >>>
> >>> Without tracing through tons of code:
> >>> - which is the last breakpoint for debugging to see the docs right
> before
> >>> they go into the index
> >>> - which is the first breakpoint for debugging to see the docs coming
> right
> >>> out of the index
> >>>
> >>> Regards
> >>> Bernd
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message