manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: extract email attachment
Date Tue, 07 Feb 2017 22:14:21 GMT
Hi Cihad,

The variable attachmentIndex is *supposed* to be null except when an
attachment is being processed.  The code should look like this:

        if (attachmentIndex == null) {
          // It's an email
...
        } else {
          // It's an attachment
          attachmentNumber = attachmentIndex;
...
        }


Karl


On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com> wrote:

> Hi Karl,
>
> I added LOG line for testing. It looks attachmentIndex is null.
>
> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>
>> I attached a second patch (to apply on top of the first patch).  Please
>> let me know if that fixes the issue.
>>
>> Karl
>>
>>
>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> I have an error as follow:
>>>
>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: For
>>> input string: "myFolder/test:<CADNgPDgSXHeWo
>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>> java.lang.NumberFormatException: For input string: "myFolder/test:<
>>> CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>         at java.lang.NumberFormatException.forInputString(NumberFormatE
>>> xception.java:65)
>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>         at org.apache.manifoldcf.crawler.connectors.email.EmailConnecto
>>> r.processDocuments(EmailConnector.java:705)
>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(Worker
>>> Thread.java:399)
>>>
>>>
>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>
>>>> Thanks Karl,
>>>>
>>>> I will try it.
>>>>
>>>> Regards
>>>> Cihad Guzel
>>>>
>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>
>>>>> I've created a ticket and attached a patch to it.  CONNECTORS-1375.
>>>>> Please let me know if it works for you; if not, I'll fix what doesn't
work.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Correction: the only metadata attribute we set is the attachment(s)
>>>>>> mimetype (as a multivalued field) -- this doesn't currently include
the
>>>>>> attachment data.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Cihad,
>>>>>>>
>>>>>>> The email connector is providing the attachment data unextracted
to
>>>>>>> the output connector as metadata attribute data.  There are no
>>>>>>> transformation connectors that look at this metadata.  Solr cell
also
>>>>>>> probably does not handle binary in random metadata attributes
the proper
>>>>>>> way.
>>>>>>>
>>>>>>> The connector's attachment code therefore seems to be designed
only
>>>>>>> to deal with textual attachments.  The right solution is to have
individual
>>>>>>> IDs for each attachment.  But that would also require there to
be a URL we
>>>>>>> could construct for each attachment.  We could provide an additional
URI
>>>>>>> template for attachments, but I'd wonder if your system has the
ability to
>>>>>>> serve attachments by their own URLs.  Please let me know if this
would work
>>>>>>> and if so I can create a ticket and work on making these changes.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I try the email connector with gmail. I attach the file [1]
in my
>>>>>>>> new email. And sent to my test email adress.
>>>>>>>>
>>>>>>>> My mail content body is like: "this is test mail for mfc"
>>>>>>>>
>>>>>>>> Then I run my email job and the email is indexed to Solr
>>>>>>>> successfully. But, the solr's content field have not my attachment's
>>>>>>>> content body. Solr content filed looks like:
>>>>>>>>
>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this
is test mail for
>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>> ..."
>>>>>>>>
>>>>>>>> Does the MFC email connector know that the attachment's file
type
>>>>>>>> is pdf? Does not extract the contents?
>>>>>>>>
>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> Cihad Güzel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Teşekkürler
>>>> Cihad Güzel
>>>>
>>>
>>>
>>>
>>> --
>>> Teşekkürler
>>> Cihad Güzel
>>>
>>
>>
>
>
> --
> Teşekkürler
> Cihad Güzel
>

Mime
View raw message