Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5D02E200CAE for ; Wed, 21 Jun 2017 19:29:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5B377160BD5; Wed, 21 Jun 2017 17:29:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A1BED160BD0 for ; Wed, 21 Jun 2017 19:29:03 +0200 (CEST) Received: (qmail 97568 invoked by uid 500); 21 Jun 2017 17:29:02 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 97557 invoked by uid 99); 21 Jun 2017 17:29:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jun 2017 17:29:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 6C71FCF254 for ; Wed, 21 Jun 2017 17:29:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -98.711 X-Spam-Level: X-Spam-Status: No, score=-98.711 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_NUMSUBJECT=0.5, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 0_vVVTlpaIbc for ; Wed, 21 Jun 2017 17:29:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 2A70F60CCB for ; Wed, 21 Jun 2017 17:29:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 68FADE0D4D for ; Wed, 21 Jun 2017 17:29:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 121CE21942 for ; Wed, 21 Jun 2017 17:29:00 +0000 (UTC) Date: Wed, 21 Jun 2017 17:29:00 +0000 (UTC) From: "Karl Wright (JIRA)" To: dev@manifoldcf.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CONNECTORS-1433) Add CLI options to pipeline modules, e.g. allow Tika to export TEXT, not BASE64 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 21 Jun 2017 17:29:04 -0000 [ https://issues.apache.org/jira/browse/CONNECTORS-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057881#comment-16057881 ] Karl Wright commented on CONNECTORS-1433: ----------------------------------------- I've never been clear on whether the ES connector is using the mapper attachment correctly or not. The content is binary (not text) and ES doesn't do its own Tika extraction of the binary, so I can see why this might be difficult. But an assumed ability to convert directly to text isn't going to work either because we do primarily output binary content. The big question is what it a better way to view this problem? (1) If ES can only accept *text* output, then we should reject all content that isn't text, and we should *not* convert to base64. That would force people generally to use the Tika transformer with the ES output connector. (2) If the mapper attachment can do some kinds of conversions, and it can convert base64 back to characters, then we can leave things as they are. Please advise. > Add CLI options to pipeline modules, e.g. allow Tika to export TEXT, not BASE64 > ------------------------------------------------------------------------------- > > Key: CONNECTORS-1433 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1433 > Project: ManifoldCF > Issue Type: Wish > Components: Tika extractor > Reporter: Steph van Schalkwyk > Assignee: Karl Wright > > Would love to have Tika spout TEXT, not BASE64. -- This message was sent by Atlassian JIRA (v6.4.14#64029)