tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arjohn Kampman (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-777) RTF parser incorrectly applies fonts to complete group
Date Tue, 08 Nov 2011 16:47:51 GMT

     [ https://issues.apache.org/jira/browse/TIKA-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Arjohn Kampman updated TIKA-777:
--------------------------------

    Description: 
Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character
encoding to the parsed characters:

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fswiss\fcharset204 Arial;}
}
{\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
}

This document contains russian characters (\f1), but tika decodes these as latin due to the
\f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes
buffer before processing directives such as these.

  was:
Tika's RTF parser processes the following rtf fragment incorrectly, applying the wrong character
encoding to the parsed characters:

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fswiss\fcharset204 Arial;}
}
{\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
}

This document contains russian characters (\f1), but tika decodes these as latin due to the
\f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes
buffer before processing directives such as these.

    
> RTF parser incorrectly applies fonts to complete group
> ------------------------------------------------------
>
>                 Key: TIKA-777
>                 URL: https://issues.apache.org/jira/browse/TIKA-777
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>
> Tika's RTF parser processes the following rtf document incorrectly, applying the wrong
character encoding to the parsed characters:
> {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
> {\fonttbl
> {\f0\fswiss\fcharset0 Arial;}
> {\f1\fswiss\fcharset204 Arial;}
> }
> {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
> }
> This document contains russian characters (\f1), but tika decodes these as latin due
to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes
buffer before processing directives such as these.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message