manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <cguz...@gmail.com>
Subject Re: extract email attachment
Date Tue, 07 Feb 2017 23:07:47 GMT
Hi Karl,

Does not 'else' part has to be proccessed when the email has an attachment?

Although the email has an attachment, only the first part was processed.
Also, I don't see the attachment's content in solr index.

I edited the code line for testing as follow:

 if (attachmentIndex == null) {
          // It's an email
          System.out.println("running if block");
...
        } else {
          System.out.println("running else block");
          // It's an attachment
          attachmentNumber = attachmentIndex;
...
        }

Then, I run my job. It processed 3 times. The log looks as like:

...
running if block
running if block
running if block
...


The solr response:

{
        "subject":["pdf test page"],
        "from":["Cihad Guzel <cguzelg@gmail.com>"],
        "id":"
http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=%3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mail.gmail.com%3E
",
        "date":["Tue Feb 07 20:37:35 MSK 2017"],
        "mimetype":["",
          ""],
        "created_date":"2017-02-07T17:37:35.000Z",
        "indexed_date":"2017-02-07T21:18:05.382Z",
        "to":["Cihad Guzel <cguzelg@gmail.com>"],
        "modified_date":"2017-02-07T17:37:35.000Z",
        "encoding":["",
          ""],
        "mime_type":"text/plain",
        "stream_size":["null"],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.txt.TXTParser"],
        "stream_content_type":["text/plain"],
        "content_encoding":["windows-1252"],
        "content_type":["text/plain; charset=windows-1252"],
        "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
 --94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative;
boundary=94eb2c1910841bc5530547f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
text/plain; charset=UTF-8\r\n\r\nthis is test mail for
mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: text/html;
charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for
mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--94eb2c1910841bc55f0547f43443\r\nContent-Type:
application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
base64\r\nX-Attachment-Id:
f_iyvt78qa0\r\n\r\nJVBERi0xLjYNJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9...
",
        "language":"en",
        "_version_":1558710621053124608}]
  }



2017-02-08 1:17 GMT+03:00 Karl Wright <daddywri@gmail.com>:

> Here's the full code for this class:
>
> https://svn.apache.org/repos/asf/manifoldcf/trunk/
> connectors/email/connector/src/main/java/org/apache/
> manifoldcf/crawler/connectors/email/EmailConnector.java
>
> Karl
>
>
> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Cihad,
>>
>> The variable attachmentIndex is *supposed* to be null except when an
>> attachment is being processed.  The code should look like this:
>>
>>         if (attachmentIndex == null) {
>>           // It's an email
>> ...
>>         } else {
>>           // It's an attachment
>>           attachmentNumber = attachmentIndex;
>> ...
>>         }
>>
>>
>> Karl
>>
>>
>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> I added LOG line for testing. It looks attachmentIndex is null.
>>>
>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>
>>>> I attached a second patch (to apply on top of the first patch).  Please
>>>> let me know if that fixes the issue.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> I have an error as follow:
>>>>>
>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: For
>>>>> input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>> java.lang.NumberFormatException: For input string: "myFolder/test:<
>>>>> CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>         at java.lang.NumberFormatExceptio
>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>         at org.apache.manifoldcf.crawler.
>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>> tor.java:705)
>>>>>         at org.apache.manifoldcf.crawler.
>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>
>>>>>
>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>>
>>>>>> Thanks Karl,
>>>>>>
>>>>>> I will try it.
>>>>>>
>>>>>> Regards
>>>>>> Cihad Guzel
>>>>>>
>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>
>>>>>>> I've created a ticket and attached a patch to it.  CONNECTORS-1375.
>>>>>>> Please let me know if it works for you; if not, I'll fix what
doesn't work.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Correction: the only metadata attribute we set is the attachment(s)
>>>>>>>> mimetype (as a multivalued field) -- this doesn't currently
include the
>>>>>>>> attachment data.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Cihad,
>>>>>>>>>
>>>>>>>>> The email connector is providing the attachment data
unextracted
>>>>>>>>> to the output connector as metadata attribute data. 
There are no
>>>>>>>>> transformation connectors that look at this metadata.
 Solr cell also
>>>>>>>>> probably does not handle binary in random metadata attributes
the proper
>>>>>>>>> way.
>>>>>>>>>
>>>>>>>>> The connector's attachment code therefore seems to be
designed
>>>>>>>>> only to deal with textual attachments.  The right solution
is to have
>>>>>>>>> individual IDs for each attachment.  But that would also
require there to
>>>>>>>>> be a URL we could construct for each attachment.  We
could provide an
>>>>>>>>> additional URI template for attachments, but I'd wonder
if your system has
>>>>>>>>> the ability to serve attachments by their own URLs. 
Please let me know if
>>>>>>>>> this would work and if so I can create a ticket and work
on making these
>>>>>>>>> changes.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I try the email connector with gmail. I attach the
file [1] in my
>>>>>>>>>> new email. And sent to my test email adress.
>>>>>>>>>>
>>>>>>>>>> My mail content body is like: "this is test mail
for mfc"
>>>>>>>>>>
>>>>>>>>>> Then I run my email job and the email is indexed
to Solr
>>>>>>>>>> successfully. But, the solr's content field have
not my attachment's
>>>>>>>>>> content body. Solr content filed looks like:
>>>>>>>>>>
>>>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail
for
>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this
is test mail for
>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>>>>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>> ..."
>>>>>>>>>>
>>>>>>>>>> Does the MFC email connector know that the attachment's
file type
>>>>>>>>>> is pdf? Does not extract the contents?
>>>>>>>>>>
>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> Cihad Güzel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Teşekkürler
>>>>>> Cihad Güzel
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Teşekkürler
>>>>> Cihad Güzel
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Teşekkürler
>>> Cihad Güzel
>>>
>>
>>
>


-- 
Teşekkürler
Cihad Güzel

Mime
View raw message