beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Beryozkin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component
Date Wed, 24 May 2017 10:37:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022676#comment-16022676
] 

Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:36 AM:
------------------------------------------------------------------

Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming
BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus
when the Beam thread comes in and calls start() and then advance(), it won't have to immediately
parse the given file content. A good number of Tika parsers can report the data in chunks
thus the proposed TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will
need to get the full control of the InputStream. However, should the PR be accepted, then
I would definitely see some scope for reusing some of currently private FileBasedSource/Reader
helpers such as for example the composite reader which is used when multiple files are picked
up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading
PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata
also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika
case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there
would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  




was (Author: sergey_beryozkin):
Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming
BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus
when the Beam thread comes in and calls start() and then advance(), it won't have to immediately
parse the given file content. A good number of Tika parsers can report the data in chunks
thus the proposed TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will
need to get the full control of the InputStream. However, should the PR be accepted, then
I would definitely see some scope for reusing some of currently private FileBasedSource/Reader
helpers such as for example the composite reader which is used when a multiple files are picked
up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading
PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata
also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika
case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there
would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  



> Introduce Apache Tika Input component
> -------------------------------------
>
>                 Key: BEAM-2328
>                 URL: https://issues.apache.org/jira/browse/BEAM-2328
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-ideas
>            Reporter: Sergey Beryozkin
>            Assignee: Davor Bonaci
>             Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing the variety
of file formats. It is used in many projects including Lucene and Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message