lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Applying Tokenizers and Filters to CopyFields
Date Wed, 25 Mar 2015 20:27:17 GMT
Hi Martin,

fq means filter query. May be you want to use qf (query fields) parameter of edismax?



On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <martin_wu@gmx.net> wrote:
Hi all, 

I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType
definition) to field contents that result from CopyFields. To be more specific, in my Solr
instance, Iwould like to support query expansion by two means: removing stop words and adding
inflected word forms as synonyms. 

To use a specific example, let’s say I have the following sentence to be indexed (from a
Wittgenstein manuscript): 

"Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


This sentence will be indexed in a field called „original“ that is defined as follows:


<field name="original" type="text_original" indexed="true" stored="true" required="true“/>

    <fieldType name="text_windex_original" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
    </fieldType>


Then, in order to create fields for the two types of query expansion, I have set up specific
fields for this: 

- one field where stopwords are removed both on the indexed content and the query. So, if
the users is searching for a phrase like „der Sprache“, Solr should still find the segment
above, because the determiners („der“ and „die“) are removed prior to indexing and
prior to querying, respectively. This field is defined as follows: 

<field name="stopwords_removed" type="text_stopwords_removed" indexed="true" stored="true"
required="true“/>

    <fieldType name="text_stopwords_removed" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words=„stopwords_de.txt"
format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt"
format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


- a second field where synonyms are added to the query so that more segments will be found.
For instance, if the user is searching for the plural form „Sprachen“, Solr should return
the segment above, due to this entry in the synonyms file: "Sprache,Sprach,Sprachen“. This
field is defined as follows: 

<field name="expanded" type="text_multiplied" indexed="true" stored="true" required="true“/>expanded

    <fieldType name="text_expanded" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt"
format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt"
format="snowball"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true"
expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Finally, to avoid having to specify three fields with identical content in the import documents,
I am defining the two fields for query expansion as copyFields: 

  <copyField source="original" dest="stopwords_removed"/>
  <copyField source="original" dest="expanded“/>

Now, my expectation would be as follows: 
- during import, two temporary fields are created by copying content from the original field
- these two temporary fields are then pre-processed as per the definitions above
- the pre-processed version of the text is added to the index
- then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“
and will always get the segment above as a matching result. 

However, what happens actually is that I get matches only for „Sprache“ and „sprache“.


The other thing that strikes as odd, is that when I restrict the search to one of the fields
only using the „fq“ parameter, I get no results. For instance: 
http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
<http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>

will return no matches. I would expected that using the fq parameter the user can specify
what type of search (s)he would like to carry out: A standard search (field original) or an
expanded search (one of the other two fields). 

For debugging, I have checked the analysis and results seem ok (posted below). 
Apologies for the long post, but I am really a bit stuck here (even after doing a lot of reading
and googling). It is probably something simple that I missing. 
Thanks a lot in advance for any help. 

Cheers, 

Martin


ST
Was
zum
Wesen

der
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
SF
Was
zum
Wesen

Welt
gehört
kann
die
Sprache
nicht
ausdrücken
LCF
was
zum
wesen

welt
gehört
kann
die
sprache
nicht
ausdrücken

Mime
View raw message