lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew May (JIRA)" <j...@apache.org>
Subject [jira] Created: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory
Date Fri, 28 Jul 2006 20:44:13 GMT
Highlighting problems with HTMLStripWhitespaceTokenizerFactory
--------------------------------------------------------------

                 Key: SOLR-42
                 URL: http://issues.apache.org/jira/browse/SOLR-42
             Project: Solr
          Issue Type: Bug
          Components: update
            Reporter: Andrew May


Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory
is used (to prevent the tag names from being searchable).

Example title field:

<SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics
in a polyorogenic terrane of NW Iberia

Searching for title:fabrics with highlighting on, the highlighted version has the <em>
tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum
of the lengths of the tags).

Response from Yonik on the solr-user mailing-list:

HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader  could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this?  The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message