[ https://issues.apache.org/jira/browse/JCR-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587611#action_12587611
]
Alexander Klimetschek commented on JCR-1530:
--------------------------------------------
Hmm, IMHO it shouldn't be Jackrabbit's concern to handle such "details", especially as text
extraction from binary files is not a mandatory aspect of the JCR API.
What about using Apache Tika? It aims to collect all the various extraction libraries and
self-built classes of the Apache project and to build a proper re-usable framework. It recently
pushed out its first release. Jukka, you probably know more about it - is it already useful
for Jackrabbit? You mentioned in JCR-1290 that this could be a task for Jackrabbit 1.5.
http://incubator.apache.org/tika/
> MsPowerPointTextExtractor does not extract from PPTs with € sign
> ----------------------------------------------------------------
>
> Key: JCR-1530
> URL: https://issues.apache.org/jira/browse/JCR-1530
> Project: Jackrabbit
> Issue Type: Bug
> Components: jackrabbit-text-extractors
> Affects Versions: 1.4
> Reporter: Dirk Feufel
>
> The MsPowerPointTextExtractor class has a problem when reading PPTs when an € sign
is contained. All text following that sign is ignored. Perhaps the POI PowerPointExtractor
should be used instead of parsing the data by hand. As a side effect, this would simply the
code. Extracting could be done as follows:
> public Reader extractText(InputStream stream, String type, String encoding) throws IOException
{
> try {
> PowerPointExtractor extractor = new PowerPointExtractor(stream);
> return new StringReader(extractor.getText(true,true));
> } catch (RuntimeException e) {
> logger.warn("Failed to extract PowerPoint text content", e);
> return new StringReader("");
> } finally {
> try { stream.close(); } catch (IOException ignored) {}
> }
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|