jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Reutegger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (JCR-1530) MsPowerPointTextExtractor does not extract from PPTs with € sign
Date Fri, 11 Apr 2008 09:20:07 GMT

    [ https://issues.apache.org/jira/browse/JCR-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587895#action_12587895
] 

Marcel Reutegger commented on JCR-1530:
---------------------------------------

There are at least two minor issues with using Tika in Jackrabbit.

- Tika is still in incubation. I'd prefer to only introduce a dependency to it when it is
out of incubation.
- Tika requires Java 1.5, whereas Jackrabbit currently is fine with 1.4.

We might want to provide an adapter, which implements the Jackrabbit TextExtractor interface
and uses Tika to extract the text. Users then can decide if they want to use it and therefore
need to use Java 1.5.

> MsPowerPointTextExtractor does not extract from PPTs with € sign
> ----------------------------------------------------------------
>
>                 Key: JCR-1530
>                 URL: https://issues.apache.org/jira/browse/JCR-1530
>             Project: Jackrabbit
>          Issue Type: Bug
>          Components: jackrabbit-text-extractors
>    Affects Versions: 1.4
>            Reporter: Dirk Feufel
>
> The MsPowerPointTextExtractor class has a problem when reading PPTs when an € sign
is contained. All text following that sign is ignored. Perhaps the POI PowerPointExtractor
should be used instead of parsing the data by hand. As a side effect, this would simply the
code. Extracting could be done as follows:
> 	public Reader extractText(InputStream stream, String type, String encoding) throws IOException
{
> 		try {
> 			PowerPointExtractor extractor = new PowerPointExtractor(stream);
> 			return new StringReader(extractor.getText(true,true));
> 		} catch (RuntimeException e) {
> 			logger.warn("Failed to extract PowerPoint text content", e);
> 			return new StringReader("");
> 		} finally {
> 			try { stream.close(); } catch (IOException ignored) {}
> 		}
> 	}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message