lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Mannina" <bmann...@free.fr>
Subject RE: Schema.xml, copyField, Slash, ignoreCase ?
Date Mon, 14 Jan 2019 08:33:38 GMT
Hi Steve,

Many thanks for this field, I will test it this afternoon in my dev' server.

Thanks also for your explanation !

Have a nice day !

Bruno

-----Message d'origine-----
De : Steve Rowe [mailto:sarowe@gmail.com] 
Envoyé : vendredi 11 janvier 2019 17:43
À : solr-user@lucene.apache.org
Objet : Re: Schema.xml, copyField, Slash, ignoreCase ?

Hi Bruno,

ignoreCase: Looks like you already have achieved this?

auto truncation: This is caused by inclusion of PorterStemFilterFactory in your "text_en"
field type.  If you don't want its effects (i.e. treating different forms of the same word
interchangeably), remove the filter.

process slash char: I think you want the slash to be included in symbol terms rather than
interpreted as a term separator.  One way to achieve this is to first, pre-tokenization, convert
the slash to a string that does not include a term separator, and then post-tokenization,
convert the substituted string back to a slash.

Here's a version of your text_en that uses PatternReplaceCharFilterFactory[1] to convert slashes
inside of symbol-ish terms (the pattern is a guess based on the symbol text you've provided;
you'll likely need to adjust it) to "_": a string unlikely to otherwise occur, and which will
not be interpreted by StandardTokenizer as a term separator; and then PatternReplaceFilterFactory[1]
to convert "_" back to slashes.  Note that the patterns for the two are slightly different,
since the *char filter* is given as input the entire field text, while the *filter* is given
the text of single terms.

-----
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" 
                replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory"
            protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>
-----

[1] http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.4.pdf

--
Steve


> On Jan 11, 2019, at 4:18 AM, Bruno Mannina <bmannina@matheo-software.com> wrote:
> 
> I need to have default “text” field with:
> 
> - ignoreCase,
> 
> - no auto truncation,
> 
> - process slash char
> 
> 
> 
> I would like to perform only query on the field “text”
> 
> Queries can contain:  code or keywords or both.
> 
> 
> 
> I have 2 fields named symbol and title, and 1 alias ti (old field that 
> I can’t delete or modify)
> 
> 
> 
> * Symbol contains code with slash (i.e A62C21/02)
> 
> <field name="symbol" type="string_ci" multiValued="false" indexed="true"
> required="true" stored="true"/>
> 
> 
> 
> * Title contains English text and also symbol
> 
>    <field name="title" type="text_en" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> 
> 
> 
> { "symbol": "B65D81/20",
> 
> "title": [
> 
> "under vacuum or superatmospheric pressure, or in a special 
> atmosphere, e.g. of inert gas  {(B65D81/28  takes precedence; 
> containers with pressurising means for maintaining ball pressure A63B39/025)} "
> 
> ]}
> 
> 
> 
> * Ti is an alias of title
> 
>    <field name="ti" type="text_general" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> 
> 
> 
> * Text is
> 
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
> 
> 
> 
> - Alias are:
> 
> 
> 
>    <copyField source="title"  dest="ti"/>
> 
>    <!-- ALIAS TEXT -->
> 
>    <copyField source="title"  dest="text"/>
> 
>    <copyField source="symbol" dest="text"/>
> 
> 
> 
> 
> 
> If I do these queries :
> 
> 
> 
> * ti:airbag                           à it’s ok
> 
> * title:airbag                      à not good for me because it found
> airbags
> 
> * ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28
> 
> * ti:”b65D81/28”              à it’s ok
> 
> * symbol:b65D81/28      à it’s ok (even without “ “)
> 
> 
> 
> NOW with “text” field
> 
> * b65D81/28                      à not good, debug shows text:b65d81 OR
> text:28
> 
> * airbag                               à it’s ok
> 
> * “b65D81/28”                  à it’s ok
> 
> 
> 
> It will be great if I can enter symbol without “ “
> 
> 
> 
> Could you help me to have a text field which solve this problem ? 
> (please find below all def of my fields)
> 
> 
> 
> Many thanks for your help.
> 
> 
> 
> String_ci is my own definition
> 
> 
> 
>    <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> 
>    <analyzer>
> 
>      <tokenizer class="solr.KeywordTokenizerFactory"/>
> 
>      <filter class="solr.LowerCaseFilterFactory"/>
> 
>    </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
> 
> 
> Best Regards
> 
> Bruno
> 
> 
> 
> 
> 
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel
antivirus Avast.
> https://www.avast.com/antivirus


Mime
View raw message