lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ajay Kanduru (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2845) Adding extra highlighting term to a synonym
Date Wed, 19 Oct 2011 15:19:10 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ajay Kanduru updated SOLR-2845:
-------------------------------

    Description: 
I notice a strange highlighting behaviour while highlighting a synonym term. It is in 3.4.0
release. This is working fine in 1.4.1. Using solr example core, here are the steps to reproduce
the problem. 

1) In *schema.xml*, change text_general fieldtype definition to use synonym filter at index
time and remove the filter from query analysis.
{code:xml}
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"
/>

    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"
/>
    <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/> -->

    <filter class="solr.LowerCaseFilterFactory"/>

  </analyzer>
</fieldType>
{code}

   
2) Define a new field 'test_field1'.
{code:xml}
  <field name="test_field1" type="text_general" indexed="true" stored="true" multiValued="true"/>
{code}

3) Copy this to 'text' field.
{code:xml}
  <copyField source="test_field1" dest="text"/>
{code}

4) In *exampledocs/ipod_video.xml*, add a new field to the doc.
{code:xml}
  <field name="test_field1">Heart Failure</field>
{code}

5) In *solr/conf/index_synonyms.txt:*, add the following line (all in one line).
{noformat}
heart failure, failure\, heart, cardiac failure, cardiac insufficiency, failure heart, failure\,
cardiac, heart failure (nos), insufficiency cardiac, insufficiency\, cardiac, hf - heart failure
{noformat}



6) Reindex exampledocs/*xml files and run the following URL.

  http://localhost:8983/solr/select?q=heart&indent=on&hl=on&hl.fl=*

This is what I get from highlighting tag.
{code:xml}
  <lst name="highlighting">
    <lst name="MA147LL/A">
      <arr name="test_field1">
        <str>&lt;em&gt;Heart&lt;/em&gt;&lt;em&gt;Heart Failure&lt;/em&gt;</str>
      </arr>
    </lst>
  </lst>
{code}

The actual value of the field is *Heart Failure*. It is changed to *Heart**Heart Failure*.

Apparently the synonym entries has something to do with the problem. The above synonym terms
are the minimum extraction from a larger line to reproduce the problem. Notice that there
is a hyphen in the last term. If I remove the hyphen, it works, even with larger line of entries.
Keeping the hyphen, and removing *insufficiency\, cardiac*, also works. So the length of the
line and hyphen both seem at play here.

Using large and complicated synonyms is very important to our application. 3.4 release has
announced some major improvements to memory foot print and performance for synonym filter.
For this reason we are eager to move to 3.4.0, but this problem is a show stopper for us.
I will appreciate any suggestions for a work around or a quick fix to the problem.

Regards,
-Ajay

  was:
I notice a strange highlighting behaviour while highlighting a synonym term. It is in 3.4.0
release. This is working fine in 1.4.1. Using solr example core, here are the steps to reproduce
the problem. 

1) In *schema.xml*, change text_general fieldtype definition to use synonym filter at index
time and remove the filter from query analysis.
{code:xml}
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"
/>

    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"
/>
    <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/> -->

    <filter class="solr.LowerCaseFilterFactory"/>

  </analyzer>
</fieldType>
{code}

   
2) Define a new field 'test_field1'.
{code:xml}
  <field name="test_field1" type="text_general" indexed="true" stored="true" multiValued="true"/>
{code}

3) Copy this to 'text' field.
{code:xml}
  <copyField source="test_field1" dest="text"/>
{code}

4) In *exampledocs/ipod_video.xml*, add a new field to the doc.
{code:xml}
  <field name="test_field1">Heart Failure</field>
{code}

5) In *solr/conf/index_synonyms.txt:*, add the following line (all in one line).
{noformat}
heart failure, failure\, heart, cardiac failure, cardiac insufficiency, failure heart, failure\,
cardiac, heart failure (nos), insufficiency cardiac, insufficiency\, cardiac, hf - heart failure
{noformat}



6) Reindex exampledocs/*xml files and run the following URL.

  http://localhost:8983/solr/select?q=heart&indent=on&hl=on&hl.fl=*

This is what I get from highlighting tag.
{code:xml}
  <lst name="highlighting">
    <lst name="MA147LL/A">
      <arr name="test_field1">
        <str>&lt;em&gt;Heart&lt;/em&gt;&lt;em&gt;Heart Failure&lt;/em&gt;</str>
      </arr>
    </lst>
  </lst>
{code}

The actual value of the field is *Heart Failure*. It is changed to *Heart**Heart Failure*.

Apparently the synonym entries has something to do with the problem. The above synonym terms
are the minimum extraction from a larger line to reproduce the problem. Notice that there
is a hyphen in the last term. If I remove the hyphen, it works, even with larger line of entries.
Keeping the hyphen, and removing *insufficiency\, cardiac*, also works. So the length of the
line and hyphen both seem at play here.

Using large and complicated synonyms is very important to our application. 3.4 has some mojor
improvements to memory foot print and performance for synonym filter. For this reason we are
eager to move to 3.4.0, but this problem is a show stopper for us. I will appreciate any suggestions
for a work around or a quick fix to the problem.

Regards,
-Ajay

    
> Adding extra highlighting term to a synonym
> -------------------------------------------
>
>                 Key: SOLR-2845
>                 URL: https://issues.apache.org/jira/browse/SOLR-2845
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 3.4
>         Environment: Solr release: 3.4.0
> JVM:
> java version "1.6.0_16"
> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
> OS: 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Ajay Kanduru
>             Fix For: 3.4
>
>
> I notice a strange highlighting behaviour while highlighting a synonym term. It is in
3.4.0 release. This is working fine in 1.4.1. Using solr example core, here are the steps
to reproduce the problem. 
> 1) In *schema.xml*, change text_general fieldtype definition to use synonym filter at
index time and remove the filter from query analysis.
> {code:xml}
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>   <analyzer type="index">
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
>     <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="true"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
>     <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/> -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}
>    
> 2) Define a new field 'test_field1'.
> {code:xml}
>   <field name="test_field1" type="text_general" indexed="true" stored="true" multiValued="true"/>
> {code}
> 3) Copy this to 'text' field.
> {code:xml}
>   <copyField source="test_field1" dest="text"/>
> {code}
> 4) In *exampledocs/ipod_video.xml*, add a new field to the doc.
> {code:xml}
>   <field name="test_field1">Heart Failure</field>
> {code}
> 5) In *solr/conf/index_synonyms.txt:*, add the following line (all in one line).
> {noformat}
> heart failure, failure\, heart, cardiac failure, cardiac insufficiency, failure heart,
failure\, cardiac, heart failure (nos), insufficiency cardiac, insufficiency\, cardiac, hf
- heart failure
> {noformat}
> 6) Reindex exampledocs/*xml files and run the following URL.
>   http://localhost:8983/solr/select?q=heart&indent=on&hl=on&hl.fl=*
> This is what I get from highlighting tag.
> {code:xml}
>   <lst name="highlighting">
>     <lst name="MA147LL/A">
>       <arr name="test_field1">
>         <str>&lt;em&gt;Heart&lt;/em&gt;&lt;em&gt;Heart
Failure&lt;/em&gt;</str>
>       </arr>
>     </lst>
>   </lst>
> {code}
> The actual value of the field is *Heart Failure*. It is changed to *Heart**Heart Failure*.
> Apparently the synonym entries has something to do with the problem. The above synonym
terms are the minimum extraction from a larger line to reproduce the problem. Notice that
there is a hyphen in the last term. If I remove the hyphen, it works, even with larger line
of entries. Keeping the hyphen, and removing *insufficiency\, cardiac*, also works. So the
length of the line and hyphen both seem at play here.
> Using large and complicated synonyms is very important to our application. 3.4 release
has announced some major improvements to memory foot print and performance for synonym filter.
For this reason we are eager to move to 3.4.0, but this problem is a show stopper for us.
I will appreciate any suggestions for a work around or a quick fix to the problem.
> Regards,
> -Ajay

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message