manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <cguz...@gmail.com>
Subject Re: extract email attachment
Date Tue, 07 Feb 2017 20:59:51 GMT
Hi Karl,

I have an error as follow:

FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: For
input string: "myFolder/test:<CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi3
7Hog5Gw@mail.gmail.com>"
java.lang.NumberFormatException: For input string: "myFolder/test:<
CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
        at java.lang.NumberFormatException.forInputString(
NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at org.apache.manifoldcf.crawler.connectors.email.EmailConnector.
processDocuments(EmailConnector.java:705)
        at org.apache.manifoldcf.crawler.system.WorkerThread.run(
WorkerThread.java:399)


2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:

> Thanks Karl,
>
> I will try it.
>
> Regards
> Cihad Guzel
>
> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>
>> I've created a ticket and attached a patch to it.  CONNECTORS-1375.
>> Please let me know if it works for you; if not, I'll fix what doesn't work.
>>
>> Karl
>>
>>
>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Correction: the only metadata attribute we set is the attachment(s)
>>> mimetype (as a multivalued field) -- this doesn't currently include the
>>> attachment data.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Cihad,
>>>>
>>>> The email connector is providing the attachment data unextracted to the
>>>> output connector as metadata attribute data.  There are no transformation
>>>> connectors that look at this metadata.  Solr cell also probably does not
>>>> handle binary in random metadata attributes the proper way.
>>>>
>>>> The connector's attachment code therefore seems to be designed only to
>>>> deal with textual attachments.  The right solution is to have individual
>>>> IDs for each attachment.  But that would also require there to be a URL we
>>>> could construct for each attachment.  We could provide an additional URI
>>>> template for attachments, but I'd wonder if your system has the ability to
>>>> serve attachments by their own URLs.  Please let me know if this would work
>>>> and if so I can create a ticket and work on making these changes.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I try the email connector with gmail. I attach the file [1] in my new
>>>>> email. And sent to my test email adress.
>>>>>
>>>>> My mail content body is like: "this is test mail for mfc"
>>>>>
>>>>> Then I run my email job and the email is indexed to Solr successfully.
>>>>> But, the solr's content field have not my attachment's content body.
Solr
>>>>> content filed looks like:
>>>>>
>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail
for
>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>> ..."
>>>>>
>>>>> Does the MFC email connector know that the attachment's file type is
>>>>> pdf? Does not extract the contents?
>>>>>
>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>> --
>>>>> Regards
>>>>> Cihad Güzel
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Teşekkürler
> Cihad Güzel
>



-- 
Teşekkürler
Cihad Güzel

Mime
View raw message