beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Beryozkin (JIRA)" <>
Subject [jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
Date Thu, 01 Jun 2017 11:41:04 GMT


Sergey Beryozkin commented on BEAM-2328:

I've added some TikaReader and TikaSource tests. Tika version was updated to 1.15 (released
by []) and common-compress to 1.14 (see TIKA-2099 for example).

In general I'd like to keep an initial contribution very much isolated, and then later on
follow up with some optimizations which would affect some other Beam modules. Specifically,
the two most immediate follow up PRs would be about updating a managed Beam common compress
dependency to 1.14 and remove the version from tika/pom.xml and attempt to refactor a bit
a FileBasedSource composite reader such that its code can be reused by TikaSource.

The last thing I'd like to investigate for a start is to check what may need to be done around
non UTF-8 charsets. I don't expect TikaReader producing anything else but Strings though.

I'm away next week, will start preparing for the initial PR shortly afterwards 


> Introduce Apache Tika Input component
> -------------------------------------
>                 Key: BEAM-2328
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-ideas, sdk-java-extensions
>            Reporter: Sergey Beryozkin
>            Assignee: Sergey Beryozkin
>             Fix For: 2.1.0
> Apache Tika is a popular project that offers an extensive support for parsing the variety
of file formats. It is used in many projects including Lucene and Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest to many users.
> PR is to follow

This message was sent by Atlassian JIRA

View raw message