tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-153) Allow passing of files or memory buffers to parsers
Date Tue, 13 Apr 2010 22:15:51 GMT

    [ https://issues.apache.org/jira/browse/TIKA-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856653#action_12856653

Jukka Zitting commented on TIKA-153:

I have an idea on how to implement this...

The current Tika APIs are already pretty good, and I'd hate to complicate the clean Parser
interface with extra methods for different kinds of inputs. Instead I'm thinking of adding
a TikaInputStream utility class that extends InputStream with methods that allow accessing
the input document as a File.

The TikaInputStream class would have at least the following construtors:

    public TikaInputStream(InputStream stream) { ... }
    public TikaInputStream(File file) { ... }

And would in addition to the standard InputStream methods provide at least the following:

    public File getFile { ... }

If the TikaInputStream instance was created from a normal InputStream, then the getFile()
method would automatically copy the stream into a temporary file that'll get removed when
the stream is closed.

The Tika facade would always pass TikaInputStreams to the underlying parsers and we'd recommend
downstream projects to use this class also when directly accessing the Parser API, but doing
so would not be necessary. Instead the TikaInputStream class would have a static method like
the following that our parsers could access the extra functionality:

    public static TikaInputStream getTikaInputStream(InputStream stream) {
        if (stream instanceof TikaInputStream) {
            return (TikaInputStream) stream;
        } else {
            return new TikaInputStream(stream);

> Allow passing of files or memory buffers to parsers
> ---------------------------------------------------
>                 Key: TIKA-153
>                 URL: https://issues.apache.org/jira/browse/TIKA-153
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
> Some of our parsers need to be able to go back and forth within a source document, so
need either a file or (for smaller documents) an in-memory buffer that contains the full document.
Currently we use temporary files for such cases, which in some cases means doing an extra
copy of a file before it gets parsed. We should come up with some way for clients to pass
in a file or a memory buffer if one is available.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message