james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Fahl ...@bitplan.com>
Subject Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)
Date Mon, 29 Sep 2014 13:59:35 GMT
Hi Eric, Ioan, Oleg and others,

as offered in July:
> I would also like to add more test cases and especially include some
> dummy mboxes. And as mentioned I'd like to check the iterator against
> all my Thunderbird mboxes to check
> whether it will successfully parse them all. 
I started doing this based on the improvements that you kindly checked
in in the meantime.
So I am working with 0.8.0-SNAPSHOT at thist time.

I intend to run the iterator against some 1/4 million emails in some 850
mailboxes. I got as far
as some message  400 with 0.7.2. With 0.8.0-SNAPSHOT the library chockes
at message some 4000
which is from the apple store !

it contains:

<2220106.13625.3174793.1262860903.627.0@apple.com>
Content-Type: TEXT/HTML; CHARSET=None
Content-Transfer-Encoding: QUOTED-PRINTABLE


And I ran into bug
https://issues.apache.org/jira/browse/MIME4J-218

I tried:
/**
     * Lenient BodyFactory that fixes
     * https://issues.apache.org/jira/browse/MIME4J-218 won't fix behaviour
     *
     * @author wf
     *
     */
    public static class LenientBodyFactory extends BasicBodyFactory {

        @Override
        public  Charset resolveCharset(final String mimeCharset)
                throws UnsupportedEncodingException {
            Charset result=Charset.defaultCharset();
            try {
                result=super.resolveCharset(mimeCharset);
            } catch (UnsupportedEncodingException ex) {
                // ignore
            }
            return result;
        }
    }

Which didn't work since resolveCharset is static private ... :-(

I proposed the following fix for
dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java:

    public static boolean lenient=true;
   
    /**
     * select the Charset for the given mimeCharset string
     *
     *  if you need support for non standard or invalid mimeCharset
specifications
     *  you might want to create your own derived BodyFactory extending
BasicBodyFactory and
     *  overriding this method as suggested by:
     *    https://issues.apache.org/jira/browse/MIME4J-218
     * 
     *  the default behaviour is lenient, invalid mimeCharset specs will
return the defaultCharset
     *
     *  @param mimeCharset - the string specification for a charset e.g.
"UTF-8"
     *  @throws UnsupportedEncodingException if the mimeCharset is invalid
     */
    protected Charset resolveCharset(final String mimeCharset) throws
UnsupportedEncodingException {
        Charset result=null;
        if (lenient) {
          result=Charset.defaultCharset();
        }
        if (mimeCharset !=null) {
          try {
          result=  Charset.forName(mimeCharset);
           } catch (UnsupportedCharsetException ex) {
               if (!lenient)
              throw new UnsupportedEncodingException(mimeCharset);
        }
      }
      return result;
    }

Now I was hoping to be able to test this fix. I assume I have to add
some test message to:
core:
   src/test/resources/testmsgs

But to really check the new behaviour they'd have to be three different
tests:
1. check invalid mimeCharset in lenient mode - will work with default
Charset
2. check invalid mimeCharset in non-lenient mode - will throw exception
3. check invalid mimeCharset in non-lenient mode with overridden
resolveCharset - will work with chosen mapped Charset.

Please let me know how I can add these tests and how get a proper
patchset going. I don't work much with subversion theses days -
i prefer to use git.

Cheers

Wolfgang

Am 10.08.14 um 10:33 schrieb Stan Ioan Eugen:
> Hello Wolfgang,
>
> Sorry for my late reply.  I've created a Jira ticket to track this
> issue. As Eric suggested, it's the right way to do get code into the
> project.
> I've looked over the code and it looks good in general. I would keep
> both variants of the regular expression to match FROM lines, with  a
> good  javadoc, so users can use any of them in their code. I would
> also move the 'mbox != null' check inside the constructor - this way
> we make sure we don't create an object in an inconsistent state.
>
> I will be more than happy to push the patch upstream once we have some
> tests for the new behavior. Are you interested in providing the tests?
>
> Please use the issue for patch submission and relevant comments.
> https://issues.apache.org/jira/browse/MIME4J-242
>
> Thanks,
>
>
> 2014-08-03 10:52 GMT+03:00 Eric Charles <eric@apache.org>:
>> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
>> and upload there your patch? Thx.
>>
>> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
>>> Hi Ioan Eugen,
>>>
>>> please find attached a patch.
>>>
>>> it uses the following fromline pattern:
>>> static final String DEFAULT = "^From \\S+.*\\d{4}$";
>>> so that it matches more lines.
>>> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
>>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
>>> 3. From - Wed Apr 02 06:51:08 2014
>>>
>>> so looking for an "@" sign is not enforced any more.
>>>
>>> The patch fixes a typo:
>>> -    private Matcher fromLineMathcer;
>>> +    private Matcher fromLineMatcher;
>>>
>>> in many places of the source code.
>>>
>>> It adds a reference to the original mbox File so that the error message:
>>> +                 if (mbox!=null)
>>> +                       path=mbox.getPath();
>>> +            throw new IllegalArgumentException("File "+path+" does not
>>> contain From_ lines that match the pattern
>>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
>>>
>>> can be improved.
>>>
>>> Who is going to check this patch and what needs to be done to get it
>>> into the official repo?
>>> I would also like to add more test cases and especially include some
>>> dummy mboxes. And as mentioned I'd like to check the iterator against
>>> all my Thunderbird mboxes to check
>>> whether it will successfully parse them all. Also I am offering to write
>>> a few "tutorial lines". Where would I have to put these?
>>>
>>> Cheers
>>>   Wolfgang
>>>
>>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
>>>> Hello Wolfgang,
>>>>
>>>> I developed MailboxIterator. It's nice to see that it's helpful :)
>>>>
>>>> You get that error because MboxIterator does not know how to split the
>>>> messages. Messages in an mbox file are separated via lines that start
>>>> with '' From:'. They are called (by me at least) 'From lines' :) .
>>>> One problem with the mbox format is that it's a bit 'free-form' in the
>>>> sense that developers abused it and we have some variants [1].
>>>>
>>>> One thing that you could try is to supply a different From line
>>>> regular expression to MboxIterator via regexpPattern argument. It will
>>>> split messages based on this new value.
>>>>
>>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>>>>
>>>> Good luck and please post the your results.
>>>>
>>>> Regards,
>>>>
>>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf@bitplan.com> wrote:
>>>>> Dear mime4j developers,
>>>>>
>>>>> for one of my projects I have been using mime4j successfully to import
>>>>> e-mail into our CRM database for some two years know.
>>>>> Currently I am trying to add a feature which would allow reading Mozilla
>>>>> Thunderbird Mailbox content.
>>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>>>>> Since I didn't find any publicly available source repository which I
>>>>> could use to access the 0.8-Snapshop I have copied
>>>>> the three source files:
>>>>> * CharBufferWrapper.java
>>>>> * FromLinePatterns.java
>>>>> * MboxIterator.java
>>>>>
>>>>> into my source tree and I am using these together with the following
>>>>> maven dependency:
>>>>>
>>>>> <!-- EMail handling -->
>>>>>         <dependency>
>>>>>             <groupId>org.apache.james</groupId>
>>>>>             <artifactId>apache-mime4j-core</artifactId>
>>>>>             <version>0.7.2</version>
>>>>>         </dependency>
>>>>>         <dependency>
>>>>>             <groupId>org.apache.james</groupId>
>>>>>             <artifactId>apache-mime4j-dom</artifactId>
>>>>>             <version>0.7.2</version>
>>>>>         </dependency>
>>>>>
>>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>>>>> files and loops thru the mails in it correctly.
>>>>> The mails can than not be directly parsed with mime4j - there is one
>>>>> newline at the begining which spoils the show. After
>>>>> working around this it's working as expected in some cases. In other
>>>>> cases there is an error:
>>>>>
>>>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>>>>> Maybe not be a vaild Mbox.
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>>>>     at
>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>>>>     at
>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>>>>     at
>>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>>>>
>>>>> By the way - there is a typo in the above error message "vaild" should
>>>>> be "valid".
>>>>>
>>>>> The error is something I'd like to fix or work-around.
>>>>>
>>>>> I have two big user accounts with several hundred mailbox files and some
>>>>> 300.000 mails from the last 15 years which I'd like
>>>>> to use as a testcase against which to run the mime4j implementation.
>>>>>
>>>>> Would you please supply me with some pointers where I get the necessary
>>>>> source code and how i could supply patches and
>>>>> testcases for the project?
>>>>>
>>>>> Also it would be good to know whether others would be interested in the
>>>>> Thunderbird Mailbox reading capability.
>>>>>
>>>>>
>>>>> Cheers
>>>>>   Wolfgang
>>>>>
>>>>> --
>>>>>
>>>>> BITPlan - smart solutions
>>>>> Wolfgang Fahl
>>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>>>>> Web: http://www.bitplan.de
>>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer:
Wolfgang Fahl
>>>>>
>>>>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang
Fahl 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message