manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <cguz...@gmail.com>
Subject Re: extract email attachment
Date Thu, 09 Feb 2017 13:07:35 GMT
Hi Karl,

I made some changes in the code and then the indexing was done successfully.

The changes are as follows:

I have removed these lines (lines: 772-775):

             if (mp.getCount() >= attachmentNumber) {
                activities.deleteDocument(documentIdentifier);
                continue;
              }

I updated these lines: (lines :1485 and 1586)
      int index2 = di.indexOf("/", index1 + 1);
as like:
      int index2 = di.indexOf(":", index1 + 1);

Regards,
Cihad Guzel




2017-02-08 2:10 GMT+03:00 Karl Wright <daddywri@gmail.com>:

> Hi Cihad,
>
> You need to set an attachment URL template for the attachments to be
> crawled.  Open your email connection and click the "URL" tab, and you will
> see the new field there.
>
> Karl
>
>
> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>
>> Hi Karl,
>>
>> Does not 'else' part has to be proccessed when the email has an
>> attachment?
>> Although the email has an attachment, only the first part was processed.
>> Also, I don't see the attachment's content in solr index.
>>
>> I edited the code line for testing as follow:
>>
>>  if (attachmentIndex == null) {
>>           // It's an email
>>           System.out.println("running if block");
>> ...
>>         } else {
>>           System.out.println("running else block");
>>           // It's an attachment
>>           attachmentNumber = attachmentIndex;
>> ...
>>         }
>>
>> Then, I run my job. It processed 3 times. The log looks as like:
>>
>> ...
>> running if block
>> running if block
>> running if block
>> ...
>>
>>
>> The solr response:
>>
>> {
>>         "subject":["pdf test page"],
>>         "from":["Cihad Guzel <cguzelg@gmail.com>"],
>>         "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=
>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai
>> l.gmail.com%3E",
>>         "date":["Tue Feb 07 20:37:35 MSK 2017"],
>>         "mimetype":["",
>>           ""],
>>         "created_date":"2017-02-07T17:37:35.000Z",
>>         "indexed_date":"2017-02-07T21:18:05.382Z",
>>         "to":["Cihad Guzel <cguzelg@gmail.com>"],
>>         "modified_date":"2017-02-07T17:37:35.000Z",
>>         "encoding":["",
>>           ""],
>>         "mime_type":"text/plain",
>>         "stream_size":["null"],
>>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>>           "org.apache.tika.parser.txt.TXTParser"],
>>         "stream_content_type":["text/plain"],
>>         "content_encoding":["windows-1252"],
>>         "content_type":["text/plain; charset=windows-1252"],
>>         "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative;
>> boundary=94eb2c1910841bc5530547f43441\r\n\r\n--94eb2c1910841
>> bc5530547f43441\r\nContent-Type: text/plain; charset=UTF-8\r\n\r\nthis
>> is test mail for mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for
>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ",
>>         "language":"en",
>>         "_version_":1558710621053124608}]
>>   }
>>
>>
>>
>> 2017-02-08 1:17 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>
>>> Here's the full code for this class:
>>>
>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors
>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler
>>> /connectors/email/EmailConnector.java
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Cihad,
>>>>
>>>> The variable attachmentIndex is *supposed* to be null except when an
>>>> attachment is being processed.  The code should look like this:
>>>>
>>>>         if (attachmentIndex == null) {
>>>>           // It's an email
>>>> ...
>>>>         } else {
>>>>           // It's an attachment
>>>>           attachmentNumber = attachmentIndex;
>>>> ...
>>>>         }
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> I added LOG line for testing. It looks attachmentIndex is null.
>>>>>
>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>
>>>>>> I attached a second patch (to apply on top of the first patch).
>>>>>> Please let me know if that fixes the issue.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> I have an error as follow:
>>>>>>>
>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed:
>>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>>> java.lang.NumberFormatException: For input string: "myFolder/test:<
>>>>>>> CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>>>         at java.lang.NumberFormatExceptio
>>>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>>>> tor.java:705)
>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>
>>>>>>>
>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>>>>
>>>>>>>> Thanks Karl,
>>>>>>>>
>>>>>>>> I will try it.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Cihad Guzel
>>>>>>>>
>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>
>>>>>>>>> I've created a ticket and attached a patch to it.
>>>>>>>>> CONNECTORS-1375.  Please let me know if it works for
you; if not, I'll fix
>>>>>>>>> what doesn't work.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Correction: the only metadata attribute we set is
the
>>>>>>>>>> attachment(s) mimetype (as a multivalued field) --
this doesn't currently
>>>>>>>>>> include the attachment data.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Cihad,
>>>>>>>>>>>
>>>>>>>>>>> The email connector is providing the attachment
data unextracted
>>>>>>>>>>> to the output connector as metadata attribute
data.  There are no
>>>>>>>>>>> transformation connectors that look at this metadata.
 Solr cell also
>>>>>>>>>>> probably does not handle binary in random metadata
attributes the proper
>>>>>>>>>>> way.
>>>>>>>>>>>
>>>>>>>>>>> The connector's attachment code therefore seems
to be designed
>>>>>>>>>>> only to deal with textual attachments.  The right
solution is to have
>>>>>>>>>>> individual IDs for each attachment.  But that
would also require there to
>>>>>>>>>>> be a URL we could construct for each attachment.
 We could provide an
>>>>>>>>>>> additional URI template for attachments, but
I'd wonder if your system has
>>>>>>>>>>> the ability to serve attachments by their own
URLs.  Please let me know if
>>>>>>>>>>> this would work and if so I can create a ticket
and work on making these
>>>>>>>>>>> changes.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel
<cguzelg@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I try the email connector with gmail. I attach
the file [1] in
>>>>>>>>>>>> my new email. And sent to my test email adress.
>>>>>>>>>>>>
>>>>>>>>>>>> My mail content body is like: "this is test
mail for mfc"
>>>>>>>>>>>>
>>>>>>>>>>>> Then I run my email job and the email is
indexed to Solr
>>>>>>>>>>>> successfully. But, the solr's content field
have not my attachment's
>>>>>>>>>>>> content body. Solr content filed looks like:
>>>>>>>>>>>>
>>>>>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n
 \n \n
>>>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is
test mail for
>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this
is test mail for
>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type:
application/pdf;
>>>>>>>>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition:
attachment;
>>>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>>>> ..."
>>>>>>>>>>>>
>>>>>>>>>>>> Does the MFC email connector know that the
attachment's file
>>>>>>>>>>>> type is pdf? Does not extract the contents?
>>>>>>>>>>>>
>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Teşekkürler
>>>>>>>> Cihad Güzel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Teşekkürler
>>>>>>> Cihad Güzel
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Teşekkürler
>>>>> Cihad Güzel
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Teşekkürler
>> Cihad Güzel
>>
>
>


-- 
Teşekkürler
Cihad Güzel

Mime
View raw message