beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Beryozkin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (BEAM-2994) Refactor TikaIO
Date Wed, 27 Sep 2017 13:17:00 GMT
Sergey Beryozkin created BEAM-2994:
--------------------------------------

             Summary: Refactor TikaIO
                 Key: BEAM-2994
                 URL: https://issues.apache.org/jira/browse/BEAM-2994
             Project: Beam
          Issue Type: Task
          Components: sdk-java-extensions
    Affects Versions: 2.2.0
            Reporter: Sergey Beryozkin
            Assignee: Reuven Lax
             Fix For: 2.2.0


TikaIO is currently implemented as a BoundedSource and asynchronous BoundedReader returning
individual document's text chunks as Strings, eventually passed unordered (and not linked
to the original documents) to the pipeline functions.

It was decided in the recent beam-dev thread that initially TikaIO should support the cases
where only a single composite bean per file, capturing the file content, location (or name)
and metadata, should flow to the pipeline, and thus avoiding the need to implement TikaIO
as a BoundedSource/Reader.

Enhancing  TikaIO to support the streaming of the content into the pipelines may be considered
in the next phase, based on the specific use-cases... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message