manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: extract email attachment
Date Tue, 07 Feb 2017 22:17:02 GMT
Here's the full code for this class:

https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors/email/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/email/EmailConnector.java

Karl


On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Cihad,
>
> The variable attachmentIndex is *supposed* to be null except when an
> attachment is being processed.  The code should look like this:
>
>         if (attachmentIndex == null) {
>           // It's an email
> ...
>         } else {
>           // It's an attachment
>           attachmentNumber = attachmentIndex;
> ...
>         }
>
>
> Karl
>
>
> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>
>> Hi Karl,
>>
>> I added LOG line for testing. It looks attachmentIndex is null.
>>
>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>
>>> I attached a second patch (to apply on top of the first patch).  Please
>>> let me know if that fixes the issue.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> I have an error as follow:
>>>>
>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: For
>>>> input string: "myFolder/test:<CADNgPDgSXHeWo
>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>> java.lang.NumberFormatException: For input string: "myFolder/test:<
>>>> CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>         at java.lang.NumberFormatException.forInputString(NumberFormatE
>>>> xception.java:65)
>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>         at org.apache.manifoldcf.crawler.connectors.email.EmailConnecto
>>>> r.processDocuments(EmailConnector.java:705)
>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(Worker
>>>> Thread.java:399)
>>>>
>>>>
>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>
>>>>> Thanks Karl,
>>>>>
>>>>> I will try it.
>>>>>
>>>>> Regards
>>>>> Cihad Guzel
>>>>>
>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>
>>>>>> I've created a ticket and attached a patch to it.  CONNECTORS-1375.
>>>>>> Please let me know if it works for you; if not, I'll fix what doesn't
work.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Correction: the only metadata attribute we set is the attachment(s)
>>>>>>> mimetype (as a multivalued field) -- this doesn't currently include
the
>>>>>>> attachment data.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Cihad,
>>>>>>>>
>>>>>>>> The email connector is providing the attachment data unextracted
to
>>>>>>>> the output connector as metadata attribute data.  There are
no
>>>>>>>> transformation connectors that look at this metadata.  Solr
cell also
>>>>>>>> probably does not handle binary in random metadata attributes
the proper
>>>>>>>> way.
>>>>>>>>
>>>>>>>> The connector's attachment code therefore seems to be designed
only
>>>>>>>> to deal with textual attachments.  The right solution is
to have individual
>>>>>>>> IDs for each attachment.  But that would also require there
to be a URL we
>>>>>>>> could construct for each attachment.  We could provide an
additional URI
>>>>>>>> template for attachments, but I'd wonder if your system has
the ability to
>>>>>>>> serve attachments by their own URLs.  Please let me know
if this would work
>>>>>>>> and if so I can create a ticket and work on making these
changes.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I try the email connector with gmail. I attach the file
[1] in my
>>>>>>>>> new email. And sent to my test email adress.
>>>>>>>>>
>>>>>>>>> My mail content body is like: "this is test mail for
mfc"
>>>>>>>>>
>>>>>>>>> Then I run my email job and the email is indexed to Solr
>>>>>>>>> successfully. But, the solr's content field have not
my attachment's
>>>>>>>>> content body. Solr content filed looks like:
>>>>>>>>>
>>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this
is test mail for
>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>>>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>> ..."
>>>>>>>>>
>>>>>>>>> Does the MFC email connector know that the attachment's
file type
>>>>>>>>> is pdf? Does not extract the contents?
>>>>>>>>>
>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> Cihad Güzel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Teşekkürler
>>>>> Cihad Güzel
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Teşekkürler
>>>> Cihad Güzel
>>>>
>>>
>>>
>>
>>
>> --
>> Teşekkürler
>> Cihad Güzel
>>
>
>

Mime
View raw message