incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer
Date Wed, 08 Aug 2012 09:44:33 GMT
Hi Peter,
Thanks for the explanation and coverage.
I think we should phase in this issue as a single entity. As you
mention it does not get more complex with a modular restructuring,
also it is important to get up to speed with the Tika deps as we are
currently way behind.

On Wed, Aug 8, 2012 at 3:10 AM, Peter Ansell <ansell.peter@gmail.com> wrote:
> Hi Lewis,
>
> It is a while since I did the update to Tika-1.1, but the upgrade
> would be very easy to do independent of any module reorganisation,
>
> The major component involved updating mimetypes.xml and
> tika-config.xml based on the resources extracted from the tika 1.1 jar
> file. https://github.com/ansell/any23/tree/ansellpatches/mime/src/main/resources/org/apache/any23/mime
>
> I also modified the default mime-type to match the current drafts for
> each of the standards and added the previous mime types as aliases, as
> Any23 has so far been using non-standard mime-types
> https://github.com/ansell/any23/commit/8d3162c6510fa76aad0316e9e8be5ea66ee0fe7c
>
> Some of the test failures that I encountered were due to the addition
> of license headers to the test files just before I started making my
> changes. The license headers had periods inside comments that
> incorrectly signalled the end of a statement to the mime detector
> regexes. This was picked up since then and the license headers were
> removed, but I think the mime type detection code still has a bug if
> people put comments in the top of RDF NQuads or RDF NTriples files, as
> it still relies on the period as a context-less delimeter.
> https://github.com/ansell/any23/blob/trunk/core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java#L96
>
> In terms of the actual detector, I ended up switching off the regex
> pattern recognition and switching to an alternative method based on
> more complex character based boundaries to extract a sample, which was
> then parsed and if the parse succeeded then it was recognised as that
> mime type. However, this may not be the best way to do it, although it
> works for me so far. This change is the main part that needs review.
> https://github.com/ansell/any23/blob/ansellpatches/mime/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java
>
> Peter
>
> On 7 August 2012 21:34, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com> wrote:
>> Hi Peter,
>>
>> Firstly thanks for the formal introduction glad that your now
>> officially on board.
>>
>> I've changed the thread topic slightly to discuss what work you have
>> done on your github branch regarding the Tika upgrade? I see that your
>> using Tika 1.1? Would it be possible to phase this into the existing
>> codebase before doing the module restructuring that we are currently
>> discussing elsewhere?
>>
>> I vaguely remember you saying that there were some problems with tests
>> or something (further to the Tika dependency upgrade) but I cannot
>> confirm this just now and it would be great if you could refresh my
>> mind.
>>
>> If we could review (with the intention to merge back into trunk) some
>> of your work more incrementally then i think we can phase in it
>> quicker... does this make sense?
>>
>> Thank very much
>> Lewis
>>
>> On Tue, Aug 7, 2012 at 1:09 AM, Peter Ansell <ansell.peter@gmail.com> wrote:
>>> Hi all,
>>>
>>> I am a software engineer with a PhD in Computer Science. I have worked
>>> on a number of RDF related projects since the start of my PhD, mainly
>>> using Sesame, including also integrating Sesame with OWLAPI [1] over
>>> the last few months to suit my current projects needs.
>>>
>>> I am looking in the short term to restructure the Maven modules inside
>>> of Any23 so that the different facets can be reused, tested and
>>> maintained easily, particularly with a view to using the RDF related
>>> Tika enhancements that the Any23 MIME Detector provides. I made these
>>> changes a few months ago in my GitHub fork [2], so feel free to review
>>> them closely to suggest enhancements before I actually start. I am not
>>> sure when I will next have time to clean up the patches. The first
>>> step that I want to take is to split out the test resources into a
>>> single module and switch from "src/test/resources/*" File based access
>>> in tests to using this.getClass().getResourceAsStream("*"). I have
>>> implemented those changes in my git repository but the patches may
>>> need cleaning up as I have not gone back to review them yet. After
>>> that is done, it will be relatively simple to split out both the
>>> packages and tests into separate modules.
>>>
>>> In the short term I have also been tasked by the Sesame Developers
>>> with merging the Any23 and Sesametools NQuads parsers and integrating
>>> the resulting module into the Sesame Rio package. Then we can have a
>>> rock-solid, standards-based, NQuads parser/writer that everyone can
>>> easily reuse in a similar way to the other Rio parsers/writers. This
>>> is the culmination of the http://www.openrdf.org/issues/browse/SES-802
>>> issue that Michele opened over a year ago.
>>>
>>> Cheers,
>>>
>>> Peter
>>>
>>> [1] https://github.com/ansell/owlapi
>>> [2] https://github.com/ansell/any23
>>>
>>> On 4 August 2012 12:25, Mattmann, Chris A (388J)
>>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>>>> Hi Folks,
>>>>
>>>> A while back, the Any23 PPMC and the Incubator PMC VOTEd to add Peter Ansell
>>>> to our ranks as a PPMC member and committer. Peter, welcome!
>>>>
>>>> Feel free to say a bit about yourself!
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>
>>
>>
>> --
>> Lewis



-- 
Lewis

Mime
View raw message