lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anders Melchiorsen (JIRA)" <>
Subject [jira] Created: (SOLR-1394) HTML stripper is splitting tokens
Date Sat, 29 Aug 2009 14:25:32 GMT
HTML stripper is splitting tokens

                 Key: SOLR-1394
             Project: Solr
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 1.4
            Reporter: Anders Melchiorsen

I am having problems with the Solr HTML stripper.

After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "G&uuml;nther".

So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.

That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.

Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.

At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message