manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <cguz...@gmail.com>
Subject Re: extract email attachment
Date Thu, 09 Feb 2017 13:29:48 GMT
Thanks Karl.

Regards,
Cihad Guzel

2017-02-09 16:27 GMT+03:00 Karl Wright <daddywri@gmail.com>:

> Hi Cihad,
> The comparison should have been:
>
> mp.getCount() <= attachmentNumber
>
> As for changing ":" to "/", the real problem is that these should all be
> ":"'s, including line 678.  My apologies.  I've committed the changes.
>
> Thanks,
> Karl
>
>
> On Thu, Feb 9, 2017 at 8:15 AM, Cihad Guzel <cguzelg@gmail.com> wrote:
>
>> Hi Karl,
>>
>> mp.getCount() is 2
>> and
>> attachmentNumber is '0' or '1' in my case.
>>
>> Regards,
>> Cihad Guzel
>>
>> 2017-02-09 16:07 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>
>>> Hi Karl,
>>>
>>> I made some changes in the code and then the indexing was done
>>> successfully.
>>>
>>> The changes are as follows:
>>>
>>> I have removed these lines (lines: 772-775):
>>>
>>>              if (mp.getCount() >= attachmentNumber) {
>>>                 activities.deleteDocument(documentIdentifier);
>>>                 continue;
>>>               }
>>>
>>> I updated these lines: (lines :1485 and 1586)
>>>       int index2 = di.indexOf("/", index1 + 1);
>>> as like:
>>>       int index2 = di.indexOf(":", index1 + 1);
>>>
>>> Regards,
>>> Cihad Guzel
>>>
>>>
>>>
>>>
>>> 2017-02-08 2:10 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>
>>>> Hi Cihad,
>>>>
>>>> You need to set an attachment URL template for the attachments to be
>>>> crawled.  Open your email connection and click the "URL" tab, and you will
>>>> see the new field there.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Does not 'else' part has to be proccessed when the email has an
>>>>> attachment?
>>>>> Although the email has an attachment, only the first part was
>>>>> processed. Also, I don't see the attachment's content in solr index.
>>>>>
>>>>> I edited the code line for testing as follow:
>>>>>
>>>>>  if (attachmentIndex == null) {
>>>>>           // It's an email
>>>>>           System.out.println("running if block");
>>>>> ...
>>>>>         } else {
>>>>>           System.out.println("running else block");
>>>>>           // It's an attachment
>>>>>           attachmentNumber = attachmentIndex;
>>>>> ...
>>>>>         }
>>>>>
>>>>> Then, I run my job. It processed 3 times. The log looks as like:
>>>>>
>>>>> ...
>>>>> running if block
>>>>> running if block
>>>>> running if block
>>>>> ...
>>>>>
>>>>>
>>>>> The solr response:
>>>>>
>>>>> {
>>>>>         "subject":["pdf test page"],
>>>>>         "from":["Cihad Guzel <cguzelg@gmail.com>"],
>>>>>         "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=
>>>>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai
>>>>> l.gmail.com%3E",
>>>>>         "date":["Tue Feb 07 20:37:35 MSK 2017"],
>>>>>         "mimetype":["",
>>>>>           ""],
>>>>>         "created_date":"2017-02-07T17:37:35.000Z",
>>>>>         "indexed_date":"2017-02-07T21:18:05.382Z",
>>>>>         "to":["Cihad Guzel <cguzelg@gmail.com>"],
>>>>>         "modified_date":"2017-02-07T17:37:35.000Z",
>>>>>         "encoding":["",
>>>>>           ""],
>>>>>         "mime_type":"text/plain",
>>>>>         "stream_size":["null"],
>>>>>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>>>>>           "org.apache.tika.parser.txt.TXTParser"],
>>>>>         "stream_content_type":["text/plain"],
>>>>>         "content_encoding":["windows-1252"],
>>>>>         "content_type":["text/plain; charset=windows-1252"],
>>>>>         "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail
for
>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ",
>>>>>         "language":"en",
>>>>>         "_version_":1558710621053124608}]
>>>>>   }
>>>>>
>>>>>
>>>>>
>>>>> 2017-02-08 1:17 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>
>>>>>> Here's the full code for this class:
>>>>>>
>>>>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors
>>>>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler
>>>>>> /connectors/email/EmailConnector.java
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Cihad,
>>>>>>>
>>>>>>> The variable attachmentIndex is *supposed* to be null except
when an
>>>>>>> attachment is being processed.  The code should look like this:
>>>>>>>
>>>>>>>         if (attachmentIndex == null) {
>>>>>>>           // It's an email
>>>>>>> ...
>>>>>>>         } else {
>>>>>>>           // It's an attachment
>>>>>>>           attachmentNumber = attachmentIndex;
>>>>>>> ...
>>>>>>>         }
>>>>>>>
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> I added LOG line for testing. It looks attachmentIndex is
null.
>>>>>>>>
>>>>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>
>>>>>>>>> I attached a second patch (to apply on top of the first
patch).
>>>>>>>>> Please let me know if that fixes the issue.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> I have an error as follow:
>>>>>>>>>>
>>>>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29')
- Error
>>>>>>>>>> tossed: For input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>>>>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>>>>>> java.lang.NumberFormatException: For input string:
>>>>>>>>>> "myFolder/test:<CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi3
>>>>>>>>>> 7Hog5Gw@mail.gmail.com>"
>>>>>>>>>>         at java.lang.NumberFormatExceptio
>>>>>>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>>>>>>> tor.java:705)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> Thanks Karl,
>>>>>>>>>>>
>>>>>>>>>>> I will try it.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Cihad Guzel
>>>>>>>>>>>
>>>>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I've created a ticket and attached a patch
to it.
>>>>>>>>>>>> CONNECTORS-1375.  Please let me know if it
works for you; if not, I'll fix
>>>>>>>>>>>> what doesn't work.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Correction: the only metadata attribute
we set is the
>>>>>>>>>>>>> attachment(s) mimetype (as a multivalued
field) -- this doesn't currently
>>>>>>>>>>>>> include the attachment data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl
Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Cihad,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The email connector is providing
the attachment data
>>>>>>>>>>>>>> unextracted to the output connector
as metadata attribute data.  There are
>>>>>>>>>>>>>> no transformation connectors that
look at this metadata.  Solr cell also
>>>>>>>>>>>>>> probably does not handle binary in
random metadata attributes the proper
>>>>>>>>>>>>>> way.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The connector's attachment code therefore
seems to be
>>>>>>>>>>>>>> designed only to deal with textual
attachments.  The right solution is to
>>>>>>>>>>>>>> have individual IDs for each attachment.
 But that would also require there
>>>>>>>>>>>>>> to be a URL we could construct for
each attachment.  We could provide an
>>>>>>>>>>>>>> additional URI template for attachments,
but I'd wonder if your system has
>>>>>>>>>>>>>> the ability to serve attachments
by their own URLs.  Please let me know if
>>>>>>>>>>>>>> this would work and if so I can create
a ticket and work on making these
>>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM,
Cihad Guzel <
>>>>>>>>>>>>>> cguzelg@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I try the email connector with
gmail. I attach the file [1]
>>>>>>>>>>>>>>> in my new email. And sent to
my test email adress.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My mail content body is like:
"this is test mail for mfc"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I run my email job and the
email is indexed to Solr
>>>>>>>>>>>>>>> successfully. But, the solr's
content field have not my attachment's
>>>>>>>>>>>>>>> content body. Solr content filed
looks like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "content":" \n \n  \n  \n  \n
 \n  \n  \n  \n \n
>>>>>>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis
is test mail for
>>>>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div
dir=\"ltr\">this is test mail for
>>>>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition:
>>>>>>>>>>>>>>> attachment; filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>>>>>>> ..."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does the MFC email connector
know that the attachment's file
>>>>>>>>>>>>>>> type is pdf? Does not extract
the contents?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Teşekkürler
>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Teşekkürler
>>>>>>>>>> Cihad Güzel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Teşekkürler
>>>>>>>> Cihad Güzel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Teşekkürler
>>>>> Cihad Güzel
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Teşekkürler
>>> Cihad Güzel
>>>
>>
>>
>>
>> --
>> Teşekkürler
>> Cihad Güzel
>>
>
>


-- 
Teşekkürler
Cihad Güzel

Mime
View raw message