incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Ansell <ansell.pe...@gmail.com>
Subject Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer
Date Wed, 08 Aug 2012 02:10:17 GMT
Hi Lewis,

It is a while since I did the update to Tika-1.1, but the upgrade
would be very easy to do independent of any module reorganisation,

The major component involved updating mimetypes.xml and
tika-config.xml based on the resources extracted from the tika 1.1 jar
file. https://github.com/ansell/any23/tree/ansellpatches/mime/src/main/resources/org/apache/any23/mime

I also modified the default mime-type to match the current drafts for
each of the standards and added the previous mime types as aliases, as
Any23 has so far been using non-standard mime-types
https://github.com/ansell/any23/commit/8d3162c6510fa76aad0316e9e8be5ea66ee0fe7c

Some of the test failures that I encountered were due to the addition
of license headers to the test files just before I started making my
changes. The license headers had periods inside comments that
incorrectly signalled the end of a statement to the mime detector
regexes. This was picked up since then and the license headers were
removed, but I think the mime type detection code still has a bug if
people put comments in the top of RDF NQuads or RDF NTriples files, as
it still relies on the period as a context-less delimeter.
https://github.com/ansell/any23/blob/trunk/core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java#L96

In terms of the actual detector, I ended up switching off the regex
pattern recognition and switching to an alternative method based on
more complex character based boundaries to extract a sample, which was
then parsed and if the parse succeeded then it was recognised as that
mime type. However, this may not be the best way to do it, although it
works for me so far. This change is the main part that needs review.
https://github.com/ansell/any23/blob/ansellpatches/mime/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java

Peter

On 7 August 2012 21:34, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com> wrote:
> Hi Peter,
>
> Firstly thanks for the formal introduction glad that your now
> officially on board.
>
> I've changed the thread topic slightly to discuss what work you have
> done on your github branch regarding the Tika upgrade? I see that your
> using Tika 1.1? Would it be possible to phase this into the existing
> codebase before doing the module restructuring that we are currently
> discussing elsewhere?
>
> I vaguely remember you saying that there were some problems with tests
> or something (further to the Tika dependency upgrade) but I cannot
> confirm this just now and it would be great if you could refresh my
> mind.
>
> If we could review (with the intention to merge back into trunk) some
> of your work more incrementally then i think we can phase in it
> quicker... does this make sense?
>
> Thank very much
> Lewis
>
> On Tue, Aug 7, 2012 at 1:09 AM, Peter Ansell <ansell.peter@gmail.com> wrote:
>> Hi all,
>>
>> I am a software engineer with a PhD in Computer Science. I have worked
>> on a number of RDF related projects since the start of my PhD, mainly
>> using Sesame, including also integrating Sesame with OWLAPI [1] over
>> the last few months to suit my current projects needs.
>>
>> I am looking in the short term to restructure the Maven modules inside
>> of Any23 so that the different facets can be reused, tested and
>> maintained easily, particularly with a view to using the RDF related
>> Tika enhancements that the Any23 MIME Detector provides. I made these
>> changes a few months ago in my GitHub fork [2], so feel free to review
>> them closely to suggest enhancements before I actually start. I am not
>> sure when I will next have time to clean up the patches. The first
>> step that I want to take is to split out the test resources into a
>> single module and switch from "src/test/resources/*" File based access
>> in tests to using this.getClass().getResourceAsStream("*"). I have
>> implemented those changes in my git repository but the patches may
>> need cleaning up as I have not gone back to review them yet. After
>> that is done, it will be relatively simple to split out both the
>> packages and tests into separate modules.
>>
>> In the short term I have also been tasked by the Sesame Developers
>> with merging the Any23 and Sesametools NQuads parsers and integrating
>> the resulting module into the Sesame Rio package. Then we can have a
>> rock-solid, standards-based, NQuads parser/writer that everyone can
>> easily reuse in a similar way to the other Rio parsers/writers. This
>> is the culmination of the http://www.openrdf.org/issues/browse/SES-802
>> issue that Michele opened over a year ago.
>>
>> Cheers,
>>
>> Peter
>>
>> [1] https://github.com/ansell/owlapi
>> [2] https://github.com/ansell/any23
>>
>> On 4 August 2012 12:25, Mattmann, Chris A (388J)
>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>>> Hi Folks,
>>>
>>> A while back, the Any23 PPMC and the Incubator PMC VOTEd to add Peter Ansell
>>> to our ranks as a PPMC member and committer. Peter, welcome!
>>>
>>> Feel free to say a bit about yourself!
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>
>
>
> --
> Lewis

Mime
View raw message