manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1410) Binary Attachment Data as Plain Text at Email Content
Date Sat, 15 Apr 2017 18:04:41 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970058#comment-15970058
] 

Karl Wright commented on CONNECTORS-1410:
-----------------------------------------

[~kamaci]: So your claim is that:

{code}
              InputStream is = msg.getInputStream();
... ingest ..
{code}

... is exactly the same as:

{code}
                  Object o = msg.getContent();
                  if (o instanceof Multipart) {
                    Multipart mp = (Multipart) msg.getContent();
                    for (int k = 0, n = mp.getCount(); k < n; k++) {
                      Part part = mp.getBodyPart(k);
                      String disposition = part.getDisposition();
                      if ((disposition == null)) {
                        MimeBodyPart mbp = (MimeBodyPart) part;
                        if (mbp.isMimeType(EmailConfig.MIMETYPE_TEXT_PLAIN)) {
                          rd.addField(EmailConfig.EMAIL_BODY, mbp.getContent().toString());
                        } else if (mbp.isMimeType(EmailConfig.MIMETYPE_HTML)) {
                          rd.addField(EmailConfig.EMAIL_BODY, mbp.getContent().toString());
//handle html accordingly. Returns content with html tags
                        }
                      }
                    }
                  } else if (o instanceof String) {
                    rd.addField(EmailConfig.EMAIL_BODY, (String)o);
                  }
... ingest ...
{code}

If that's the case then I should have caught this earlier; having the BODY be indexed twice
is just plain wrong.  I think we should take out all reference to the BODY throughout the
connector and just use the InputStream.



> Binary Attachment Data as Plain Text at Email Content
> -----------------------------------------------------
>
>                 Key: CONNECTORS-1410
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1410
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Email connector
>    Affects Versions: ManifoldCF 2.6
>            Reporter: Furkan KAMACI
>            Assignee: Furkan KAMACI
>             Fix For: ManifoldCF 2.8
>
>         Attachments: CONNECTORS-1410.patch, CONNECTORS-1410.patch
>
>
> Previously, we were indexing e-mails and its attachments together. We changed this logic
with CONNECTORS-1375 as indexing e-mail and its attachments separately.
> However, there is a problem. Content fields of emails which has attachment(s) includes
both body and attachments's binary content as plain text.
> As we index attachments separately, we can just index body as content instead of appending
email body and all attachments' binary data as plain text.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message