jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danilo Barboza (JIRA)" <j...@apache.org>
Subject [jira] Created: (JCR-1727) HTMLTextExtractor modifying UTF-8 encoded String
Date Thu, 28 Aug 2008 14:10:45 GMT
HTMLTextExtractor modifying UTF-8 encoded String

                 Key: JCR-1727
                 URL: https://issues.apache.org/jira/browse/JCR-1727
             Project: Jackrabbit
          Issue Type: Bug
          Components: jackrabbit-text-extractors
    Affects Versions: 1.4
         Environment: JDK 1.5 passing -Dfile.encoding=UTF-8 to JVM
            Reporter: Danilo Barboza

Trying to extract an HTML that is UTF-8 encoded is modifying the UTF-8 special char (like
á, é, ó, ã etc).

This cause a wrong search, cause lucene use this extractor to index content.

See attachments for an example of the problem.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message