Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 47C4D200C14 for ; Tue, 7 Feb 2017 20:36:19 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 465AC160B3E; Tue, 7 Feb 2017 19:36:19 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 45984160B32 for ; Tue, 7 Feb 2017 20:36:18 +0100 (CET) Received: (qmail 92594 invoked by uid 500); 7 Feb 2017 19:36:12 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 92584 invoked by uid 99); 7 Feb 2017 19:36:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Feb 2017 19:36:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F387C189F7A for ; Tue, 7 Feb 2017 19:36:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.091 X-Spam-Level: *** X-Spam-Status: No, score=3.091 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_10_20=1.162, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id rVlzwSCCSUhk for ; Tue, 7 Feb 2017 19:36:08 +0000 (UTC) Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4C0045F30A for ; Tue, 7 Feb 2017 19:36:07 +0000 (UTC) Received: by mail-it0-f44.google.com with SMTP id c7so87112293itd.1 for ; Tue, 07 Feb 2017 11:36:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=FsENdPBO4MCTMEXLY3Tij4qi2idHqqZU5WrAqcpYLGQ=; b=OSLm5ufBZmJ0Xn+KV55OnazkcL6KZ+wJ/t6xyYSMomSzNNPSUe2wE2uUctpmB1zsrz /Dkk3Zf3VeXRhnDZEFP6qVclaKXtDwUcPWFJc2OrTA0NVek2W0nOQXBtbe4Ol04R6QiG BkQrvlpsMNFUc+GApT4ljfMYtKd5c6VvDLjoLK9GLMGKkxPeeGwjWXvhHY738DRm6RFR XF3wYc0YVI+Y9h+OyZutsy33FUZspB0GebS2Qs7XIl61ljjn9bDscGvr/1nxwvXnQ6E9 2dGDcn+9ziPaG3zwd+mU7OcYglq6fAnIFFXRakgFQKgq/tqMvMqz/XVXKyfDXRDdRdTL 728A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=FsENdPBO4MCTMEXLY3Tij4qi2idHqqZU5WrAqcpYLGQ=; b=LettwFT3KjmESH5EZfnmbnfsH2d7kI6Ihm4gBkcJxUWalWaTxBeSrvqqtlo6/E0Bu4 tQlT65lDUpeRJM+tjLYTV92yHChMsXhPSGiPuRCZXgkUJvSkaViwZtHJtG3Li66GDwxG H3nHN1QM2XgcRi2QkdNxXl74RuKs3fhfpfVH0upktuDdnAG/VuG+ROUk7Pe2RGuxIEvk PvFjun9jBs9y6qwscdIClz0ikaHONsdrzYyuI9i/YVONrO5iYuQSvRfBhwTZWvrkSaJZ 8VRW4ex/6F35SBWS3ZVo0vxDY7rpIdvZaeanOmNz+7lMTYsiOriACV32XwFgIyv+ynpC hFwg== X-Gm-Message-State: AIkVDXJnDg0iX/jqo0RBZe4eNlJQHjt9j50wwC5bykpg17ZXX9X90HKlzmAyN+xFGX1qCIapTu/yyTsxzSflTA== X-Received: by 10.36.54.18 with SMTP id l18mr12906926itl.76.1486496166147; Tue, 07 Feb 2017 11:36:06 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.47.232 with HTTP; Tue, 7 Feb 2017 11:36:05 -0800 (PST) In-Reply-To: References: From: Karl Wright Date: Tue, 7 Feb 2017 14:36:05 -0500 Message-ID: Subject: Re: extract email attachment To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a1144e164f1833b0547f5dbcd archived-at: Tue, 07 Feb 2017 19:36:19 -0000 --001a1144e164f1833b0547f5dbcd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I've created a ticket and attached a patch to it. CONNECTORS-1375. Please let me know if it works for you; if not, I'll fix what doesn't work. Karl On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright wrote: > Correction: the only metadata attribute we set is the attachment(s) > mimetype (as a multivalued field) -- this doesn't currently include the > attachment data. > > Karl > > > On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright wrote: > >> Hi Cihad, >> >> The email connector is providing the attachment data unextracted to the >> output connector as metadata attribute data. There are no transformatio= n >> connectors that look at this metadata. Solr cell also probably does not >> handle binary in random metadata attributes the proper way. >> >> The connector's attachment code therefore seems to be designed only to >> deal with textual attachments. The right solution is to have individual >> IDs for each attachment. But that would also require there to be a URL = we >> could construct for each attachment. We could provide an additional URI >> template for attachments, but I'd wonder if your system has the ability = to >> serve attachments by their own URLs. Please let me know if this would w= ork >> and if so I can create a ticket and work on making these changes. >> >> Thanks, >> Karl >> >> >> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel wrote: >> >>> Hi, >>> >>> I try the email connector with gmail. I attach the file [1] in my new >>> email. And sent to my test email adress. >>> >>> My mail content body is like: "this is test mail for mfc" >>> >>> Then I run my email job and the email is indexed to Solr successfully. >>> But, the solr's content field have not my attachment's content body. So= lr >>> content filed looks like: >>> >>> "content":" \n \n \n \n \n \n \n \n \n \n >>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative; >>> boundary=3D94eb2c1910841bc5530547f43441\r\n\r\n--94eb2c1910841 >>> bc5530547f43441\r\nContent-Type: text/plain; charset=3DUTF-8\r\n\r\nthi= s >>> is test mail for mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-= Type: >>> text/html; charset=3DUTF-8\r\n\r\n
this is test mail = for >>> mfc.\r\n
\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; >>> name=3D\"pdf-test.pdf\"\r\nContent-Disposition: attachment; >>> filename=3D\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA >>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J >>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA >>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM ..." >>> >>> Does the MFC email connector know that the attachment's file type is >>> pdf? Does not extract the contents? >>> >>> [1] http://www.orimi.com/pdf-test.pdf >>> -- >>> Regards >>> Cihad G=C3=BCzel >>> >> >> > --001a1144e164f1833b0547f5dbcd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I've created a ticket and attached a patch to it.=C2= =A0 CONNECTORS-1375.=C2=A0 Please let me know if it works for you; if not, = I'll fix what doesn't work.

Karl

<= /div>

On Tue= , Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com> = wrote:
Correction: the o= nly metadata attribute we set is the attachment(s) mimetype (as a multivalu= ed field) -- this doesn't currently include the attachment data.

Karl
=

On Tue, Feb 7, 2017 at= 1:14 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Cihad,

The = email connector is providing the attachment data unextracted to the output = connector as metadata attribute data.=C2=A0 There are no transformation con= nectors that look at this metadata.=C2=A0 Solr cell also probably does not = handle binary in random metadata attributes the proper way.

<= /div>
The connector's attachment code therefore seems to be designe= d only to deal with textual attachments.=C2=A0 The right solution is to hav= e individual IDs for each attachment.=C2=A0 But that would also require the= re to be a URL we could construct for each attachment.=C2=A0 We could provi= de an additional URI template for attachments, but I'd wonder if your s= ystem has the ability to serve attachments by their own URLs.=C2=A0 Please = let me know if this would work and if so I can create a ticket and work on = making these changes.

Thanks,
Karl
=


On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <cguzelg@gmail.c= om> wrote:
Hi,

I try the email connector with gmail. I attach the = file [1] in my new email. And sent to my test email adress.=C2=A0

My mail content body is like: "this is test mail for m= fc"

Then I run my email job and the emai= l is indexed to Solr successfully. But, the solr's content field have n= ot my attachment's content body. Solr content filed looks like:

"content":" \n \n =C2=A0\n =C2=A0\n =C2=A0= \n =C2=A0\n =C2=A0\n =C2=A0\n =C2=A0\n \n =C2=A0--94eb2c1910841bc55f0547f43= 443\r\nContent-Type: multipart/alternative; boundary=3D94eb2c1910841bc= 5530547f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Ty= pe: text/plain; charset=3DUTF-8\r\n\r\nthis is test mail for mfc.\r\n\= r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: text/html; charset= =3DUTF-8\r\n\r\n<div dir=3D\"ltr\">this is test mail for mf= c.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--9= 4eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; name=3D\= "pdf-test.pdf\"\r\nContent-Disposition: attachment; filename= =3D\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: base64\r\n= X-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjYNJeLjz9MNCjM3IDAgb2JqID= w8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDAvRSAx\r\nNDExNS9OIDEvVCAxOTc= 5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2JqDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDAxNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAw= MDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM ..."

Does the MFC email connector know that the attachment's file type is= pdf? Does not extract the contents?

--
Regards
Cihad G=C3=BCzel



--001a1144e164f1833b0547f5dbcd--