lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Klaas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory
Date Mon, 07 Jan 2008 20:14:34 GMT

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556676#action_12556676
] 

Mike Klaas commented on SOLR-42:
--------------------------------

> Of course, the real answer may be as suggested earlier and to apply the stripreader before
sending to Solr.

HTMLStripTokenizer currently breaks the tokenizer contract, so it seems like the real answer
is to fix the offsets.  I've glanced at the code, and it would be a significant amount of
work to make the current implementation adhere to this contract.  The main problem is that
no-one is really interested in doing this work.

> > What do you imagine the highlighter being able to do with that knowledge?

> My understanding of looking at the code is the disjoint comes from line 298. In the call
to Lucene's highlighter, we pass in the TokenStream, which has > been stripped (or will
be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts[0]).
If, docTexts[0] was stripped first, > then I think the offsets would be the same, no? Of
course, it would be really easy to test.

You know, this is incredibly hacky, but I think that it is a great idea.

-Mike



> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch,
SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the
HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic
fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em>
tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum
of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message