manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <cguz...@gmail.com>
Subject Re: extract email attachment
Date Thu, 09 Feb 2017 13:15:19 GMT
Hi Karl,

mp.getCount() is 2
and
attachmentNumber is '0' or '1' in my case.

Regards,
Cihad Guzel

2017-02-09 16:07 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:

> Hi Karl,
>
> I made some changes in the code and then the indexing was done
> successfully.
>
> The changes are as follows:
>
> I have removed these lines (lines: 772-775):
>
>              if (mp.getCount() >= attachmentNumber) {
>                 activities.deleteDocument(documentIdentifier);
>                 continue;
>               }
>
> I updated these lines: (lines :1485 and 1586)
>       int index2 = di.indexOf("/", index1 + 1);
> as like:
>       int index2 = di.indexOf(":", index1 + 1);
>
> Regards,
> Cihad Guzel
>
>
>
>
> 2017-02-08 2:10 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>
>> Hi Cihad,
>>
>> You need to set an attachment URL template for the attachments to be
>> crawled.  Open your email connection and click the "URL" tab, and you will
>> see the new field there.
>>
>> Karl
>>
>>
>> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> Does not 'else' part has to be proccessed when the email has an
>>> attachment?
>>> Although the email has an attachment, only the first part was processed.
>>> Also, I don't see the attachment's content in solr index.
>>>
>>> I edited the code line for testing as follow:
>>>
>>>  if (attachmentIndex == null) {
>>>           // It's an email
>>>           System.out.println("running if block");
>>> ...
>>>         } else {
>>>           System.out.println("running else block");
>>>           // It's an attachment
>>>           attachmentNumber = attachmentIndex;
>>> ...
>>>         }
>>>
>>> Then, I run my job. It processed 3 times. The log looks as like:
>>>
>>> ...
>>> running if block
>>> running if block
>>> running if block
>>> ...
>>>
>>>
>>> The solr response:
>>>
>>> {
>>>         "subject":["pdf test page"],
>>>         "from":["Cihad Guzel <cguzelg@gmail.com>"],
>>>         "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=
>>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai
>>> l.gmail.com%3E",
>>>         "date":["Tue Feb 07 20:37:35 MSK 2017"],
>>>         "mimetype":["",
>>>           ""],
>>>         "created_date":"2017-02-07T17:37:35.000Z",
>>>         "indexed_date":"2017-02-07T21:18:05.382Z",
>>>         "to":["Cihad Guzel <cguzelg@gmail.com>"],
>>>         "modified_date":"2017-02-07T17:37:35.000Z",
>>>         "encoding":["",
>>>           ""],
>>>         "mime_type":"text/plain",
>>>         "stream_size":["null"],
>>>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>>>           "org.apache.tika.parser.txt.TXTParser"],
>>>         "stream_content_type":["text/plain"],
>>>         "content_encoding":["windows-1252"],
>>>         "content_type":["text/plain; charset=windows-1252"],
>>>         "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative;
>>> boundary=94eb2c1910841bc5530547f43441\r\n\r\n--94eb2c1910841
>>> bc5530547f43441\r\nContent-Type: text/plain; charset=UTF-8\r\n\r\nthis
>>> is test mail for mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for
>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ",
>>>         "language":"en",
>>>         "_version_":1558710621053124608}]
>>>   }
>>>
>>>
>>>
>>> 2017-02-08 1:17 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>
>>>> Here's the full code for this class:
>>>>
>>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors
>>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler
>>>> /connectors/email/EmailConnector.java
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Cihad,
>>>>>
>>>>> The variable attachmentIndex is *supposed* to be null except when an
>>>>> attachment is being processed.  The code should look like this:
>>>>>
>>>>>         if (attachmentIndex == null) {
>>>>>           // It's an email
>>>>> ...
>>>>>         } else {
>>>>>           // It's an attachment
>>>>>           attachmentNumber = attachmentIndex;
>>>>> ...
>>>>>         }
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com>
wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> I added LOG line for testing. It looks attachmentIndex is null.
>>>>>>
>>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>
>>>>>>> I attached a second patch (to apply on top of the first patch).
>>>>>>> Please let me know if that fixes the issue.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> I have an error as follow:
>>>>>>>>
>>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error
tossed:
>>>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>>>> java.lang.NumberFormatException: For input string: "myFolder/test:<
>>>>>>>> CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com
>>>>>>>> >"
>>>>>>>>         at java.lang.NumberFormatExceptio
>>>>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>>>>> tor.java:705)
>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>>>>>
>>>>>>>>> Thanks Karl,
>>>>>>>>>
>>>>>>>>> I will try it.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Cihad Guzel
>>>>>>>>>
>>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>
>>>>>>>>>> I've created a ticket and attached a patch to it.
>>>>>>>>>> CONNECTORS-1375.  Please let me know if it works
for you; if not, I'll fix
>>>>>>>>>> what doesn't work.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Correction: the only metadata attribute we set
is the
>>>>>>>>>>> attachment(s) mimetype (as a multivalued field)
-- this doesn't currently
>>>>>>>>>>> include the attachment data.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Cihad,
>>>>>>>>>>>>
>>>>>>>>>>>> The email connector is providing the attachment
data
>>>>>>>>>>>> unextracted to the output connector as metadata
attribute data.  There are
>>>>>>>>>>>> no transformation connectors that look at
this metadata.  Solr cell also
>>>>>>>>>>>> probably does not handle binary in random
metadata attributes the proper
>>>>>>>>>>>> way.
>>>>>>>>>>>>
>>>>>>>>>>>> The connector's attachment code therefore
seems to be designed
>>>>>>>>>>>> only to deal with textual attachments.  The
right solution is to have
>>>>>>>>>>>> individual IDs for each attachment.  But
that would also require there to
>>>>>>>>>>>> be a URL we could construct for each attachment.
 We could provide an
>>>>>>>>>>>> additional URI template for attachments,
but I'd wonder if your system has
>>>>>>>>>>>> the ability to serve attachments by their
own URLs.  Please let me know if
>>>>>>>>>>>> this would work and if so I can create a
ticket and work on making these
>>>>>>>>>>>> changes.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel
<cguzelg@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I try the email connector with gmail.
I attach the file [1] in
>>>>>>>>>>>>> my new email. And sent to my test email
adress.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My mail content body is like: "this is
test mail for mfc"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I run my email job and the email
is indexed to Solr
>>>>>>>>>>>>> successfully. But, the solr's content
field have not my attachment's
>>>>>>>>>>>>> content body. Solr content filed looks
like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n
 \n  \n \n
>>>>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis
is test mail for
>>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div
dir=\"ltr\">this is test mail for
>>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition:
>>>>>>>>>>>>> attachment; filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>>>>> ..."
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does the MFC email connector know that
the attachment's file
>>>>>>>>>>>>> type is pdf? Does not extract the contents?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Teşekkürler
>>>>>>>>> Cihad Güzel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Teşekkürler
>>>>>>>> Cihad Güzel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Teşekkürler
>>>>>> Cihad Güzel
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Teşekkürler
>>> Cihad Güzel
>>>
>>
>>
>
>
> --
> Teşekkürler
> Cihad Güzel
>



-- 
Teşekkürler
Cihad Güzel

Mime
View raw message