manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: extract email attachment
Date Thu, 09 Feb 2017 13:27:06 GMT
Hi Cihad,
The comparison should have been:

mp.getCount() <= attachmentNumber

As for changing ":" to "/", the real problem is that these should all be
":"'s, including line 678.  My apologies.  I've committed the changes.

Thanks,
Karl


On Thu, Feb 9, 2017 at 8:15 AM, Cihad Guzel <cguzelg@gmail.com> wrote:

> Hi Karl,
>
> mp.getCount() is 2
> and
> attachmentNumber is '0' or '1' in my case.
>
> Regards,
> Cihad Guzel
>
> 2017-02-09 16:07 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>
>> Hi Karl,
>>
>> I made some changes in the code and then the indexing was done
>> successfully.
>>
>> The changes are as follows:
>>
>> I have removed these lines (lines: 772-775):
>>
>>              if (mp.getCount() >= attachmentNumber) {
>>                 activities.deleteDocument(documentIdentifier);
>>                 continue;
>>               }
>>
>> I updated these lines: (lines :1485 and 1586)
>>       int index2 = di.indexOf("/", index1 + 1);
>> as like:
>>       int index2 = di.indexOf(":", index1 + 1);
>>
>> Regards,
>> Cihad Guzel
>>
>>
>>
>>
>> 2017-02-08 2:10 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>
>>> Hi Cihad,
>>>
>>> You need to set an attachment URL template for the attachments to be
>>> crawled.  Open your email connection and click the "URL" tab, and you will
>>> see the new field there.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Does not 'else' part has to be proccessed when the email has an
>>>> attachment?
>>>> Although the email has an attachment, only the first part was
>>>> processed. Also, I don't see the attachment's content in solr index.
>>>>
>>>> I edited the code line for testing as follow:
>>>>
>>>>  if (attachmentIndex == null) {
>>>>           // It's an email
>>>>           System.out.println("running if block");
>>>> ...
>>>>         } else {
>>>>           System.out.println("running else block");
>>>>           // It's an attachment
>>>>           attachmentNumber = attachmentIndex;
>>>> ...
>>>>         }
>>>>
>>>> Then, I run my job. It processed 3 times. The log looks as like:
>>>>
>>>> ...
>>>> running if block
>>>> running if block
>>>> running if block
>>>> ...
>>>>
>>>>
>>>> The solr response:
>>>>
>>>> {
>>>>         "subject":["pdf test page"],
>>>>         "from":["Cihad Guzel <cguzelg@gmail.com>"],
>>>>         "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=
>>>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai
>>>> l.gmail.com%3E",
>>>>         "date":["Tue Feb 07 20:37:35 MSK 2017"],
>>>>         "mimetype":["",
>>>>           ""],
>>>>         "created_date":"2017-02-07T17:37:35.000Z",
>>>>         "indexed_date":"2017-02-07T21:18:05.382Z",
>>>>         "to":["Cihad Guzel <cguzelg@gmail.com>"],
>>>>         "modified_date":"2017-02-07T17:37:35.000Z",
>>>>         "encoding":["",
>>>>           ""],
>>>>         "mime_type":"text/plain",
>>>>         "stream_size":["null"],
>>>>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>>>>           "org.apache.tika.parser.txt.TXTParser"],
>>>>         "stream_content_type":["text/plain"],
>>>>         "content_encoding":["windows-1252"],
>>>>         "content_type":["text/plain; charset=windows-1252"],
>>>>         "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: text/html;
>>>> charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for
>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ",
>>>>         "language":"en",
>>>>         "_version_":1558710621053124608}]
>>>>   }
>>>>
>>>>
>>>>
>>>> 2017-02-08 1:17 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>
>>>>> Here's the full code for this class:
>>>>>
>>>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors
>>>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler
>>>>> /connectors/email/EmailConnector.java
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Cihad,
>>>>>>
>>>>>> The variable attachmentIndex is *supposed* to be null except when
an
>>>>>> attachment is being processed.  The code should look like this:
>>>>>>
>>>>>>         if (attachmentIndex == null) {
>>>>>>           // It's an email
>>>>>> ...
>>>>>>         } else {
>>>>>>           // It's an attachment
>>>>>>           attachmentNumber = attachmentIndex;
>>>>>> ...
>>>>>>         }
>>>>>>
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> I added LOG line for testing. It looks attachmentIndex is null.
>>>>>>>
>>>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>
>>>>>>>> I attached a second patch (to apply on top of the first patch).
>>>>>>>> Please let me know if that fixes the issue.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> I have an error as follow:
>>>>>>>>>
>>>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') -
Error tossed:
>>>>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>>>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>>>>> java.lang.NumberFormatException: For input string:
>>>>>>>>> "myFolder/test:<CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi3
>>>>>>>>> 7Hog5Gw@mail.gmail.com>"
>>>>>>>>>         at java.lang.NumberFormatExceptio
>>>>>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>>>>>> tor.java:705)
>>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Thanks Karl,
>>>>>>>>>>
>>>>>>>>>> I will try it.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Cihad Guzel
>>>>>>>>>>
>>>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> I've created a ticket and attached a patch to
it.
>>>>>>>>>>> CONNECTORS-1375.  Please let me know if it works
for you; if not, I'll fix
>>>>>>>>>>> what doesn't work.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Correction: the only metadata attribute we
set is the
>>>>>>>>>>>> attachment(s) mimetype (as a multivalued
field) -- this doesn't currently
>>>>>>>>>>>> include the attachment data.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Cihad,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The email connector is providing the
attachment data
>>>>>>>>>>>>> unextracted to the output connector as
metadata attribute data.  There are
>>>>>>>>>>>>> no transformation connectors that look
at this metadata.  Solr cell also
>>>>>>>>>>>>> probably does not handle binary in random
metadata attributes the proper
>>>>>>>>>>>>> way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The connector's attachment code therefore
seems to be designed
>>>>>>>>>>>>> only to deal with textual attachments.
 The right solution is to have
>>>>>>>>>>>>> individual IDs for each attachment. 
But that would also require there to
>>>>>>>>>>>>> be a URL we could construct for each
attachment.  We could provide an
>>>>>>>>>>>>> additional URI template for attachments,
but I'd wonder if your system has
>>>>>>>>>>>>> the ability to serve attachments by their
own URLs.  Please let me know if
>>>>>>>>>>>>> this would work and if so I can create
a ticket and work on making these
>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad
Guzel <
>>>>>>>>>>>>> cguzelg@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I try the email connector with gmail.
I attach the file [1]
>>>>>>>>>>>>>> in my new email. And sent to my test
email adress.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My mail content body is like: "this
is test mail for mfc"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I run my email job and the email
is indexed to Solr
>>>>>>>>>>>>>> successfully. But, the solr's content
field have not my attachment's
>>>>>>>>>>>>>> content body. Solr content filed
looks like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "content":" \n \n  \n  \n  \n  \n
 \n  \n  \n \n
>>>>>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis
is test mail for
>>>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div
dir=\"ltr\">this is test mail for
>>>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition:
>>>>>>>>>>>>>> attachment; filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>>>>>> ..."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does the MFC email connector know
that the attachment's file
>>>>>>>>>>>>>> type is pdf? Does not extract the
contents?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Teşekkürler
>>>>>>>>>> Cihad Güzel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Teşekkürler
>>>>>>>>> Cihad Güzel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Teşekkürler
>>>>>>> Cihad Güzel
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Teşekkürler
>>>> Cihad Güzel
>>>>
>>>
>>>
>>
>>
>> --
>> Teşekkürler
>> Cihad Güzel
>>
>
>
>
> --
> Teşekkürler
> Cihad Güzel
>

Mime
View raw message