manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1433) Add CLI options to pipeline modules, e.g. allow Tika to export TEXT, not BASE64
Date Wed, 21 Jun 2017 17:29:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057881#comment-16057881
] 

Karl Wright commented on CONNECTORS-1433:
-----------------------------------------

I've never been clear on whether the ES connector is using the mapper attachment correctly
or not.  The content is binary (not text) and ES doesn't do its own Tika extraction of the
binary, so I can see why this might be difficult.  But an assumed ability to convert directly
to text isn't going to work either because we do primarily output binary content.

The big question is what it a better way to view this problem?

(1) If ES can only accept *text* output, then we should reject all content that isn't text,
and we should *not* convert to base64.  That would force people generally to use the Tika
transformer with the ES output connector.
(2) If the mapper attachment can do some kinds of conversions, and it can convert base64 back
to characters, then we can leave things as they are.


Please advise.






> Add CLI options to pipeline modules, e.g. allow Tika to export TEXT, not BASE64
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1433
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1433
>             Project: ManifoldCF
>          Issue Type: Wish
>          Components: Tika extractor
>            Reporter: Steph van Schalkwyk
>            Assignee: Karl Wright
>
> Would love to have Tika spout TEXT, not BASE64.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message