lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anders Melchiorsen (JIRA)" <j...@apache.org>
Subject [jira] Created: (SOLR-1394) HTML stripper is splitting tokens
Date Sat, 29 Aug 2009 14:25:32 GMT
HTML stripper is splitting tokens
---------------------------------

                 Key: SOLR-1394
                 URL: https://issues.apache.org/jira/browse/SOLR-1394
             Project: Solr
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 1.4
            Reporter: Anders Melchiorsen


I am having problems with the Solr HTML stripper.

After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "G&uuml;nther".

So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.

That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.

Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.

At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message