incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Mostarda <michele.mosta...@gmail.com>
Subject Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer
Date Wed, 08 Aug 2012 09:12:35 GMT
Hi guys,

On 7 August 2012 13:34, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>wrote:

> Hi Peter,
>
> Firstly thanks for the formal introduction glad that your now
> officially on board.
>
> I've changed the thread topic slightly to discuss what work you have
> done on your github branch regarding the Tika upgrade? I see that your
> using Tika 1.1? Would it be possible to phase this into the existing
> codebase before doing the module restructuring that we are currently
> discussing elsewhere?
>
> I vaguely remember you saying that there were some problems with tests
> or something (further to the Tika dependency upgrade) but I cannot
> confirm this just now and it would be great if you could refresh my
> mind.
>
> If we could review (with the intention to merge back into trunk) some
> of your work more incrementally then i think we can phase in it
> quicker... does this make sense?
>
> Thank very much
> Lewis
>
> On Tue, Aug 7, 2012 at 1:09 AM, Peter Ansell <ansell.peter@gmail.com>
> wrote:
> > Hi all,
> >
> > I am a software engineer with a PhD in Computer Science. I have worked
> > on a number of RDF related projects since the start of my PhD, mainly
> > using Sesame, including also integrating Sesame with OWLAPI [1] over
> > the last few months to suit my current projects needs.
> >
> > I am looking in the short term to restructure the Maven modules inside
> > of Any23 so that the different facets can be reused, tested and
> > maintained easily, particularly with a view to using the RDF related
> > Tika enhancements that the Any23 MIME Detector provides. I made these
> > changes a few months ago in my GitHub fork [2], so feel free to review
> > them closely to suggest enhancements before I actually start. I am not
> > sure when I will next have time to clean up the patches. The first
> > step that I want to take is to split out the test resources into a
> > single module and switch from "src/test/resources/*" File based access
> > in tests to using this.getClass().getResourceAsStream("*"). I have
> > implemented those changes in my git repository but the patches may
> > need cleaning up as I have not gone back to review them yet. After
> > that is done, it will be relatively simple to split out both the
> > packages and tests into separate modules.
> >
> > In the short term I have also been tasked by the Sesame Developers
> > with merging the Any23 and Sesametools NQuads parsers and integrating
> > the resulting module into the Sesame Rio package. Then we can have a
> > rock-solid, standards-based, NQuads parser/writer that everyone can
> > easily reuse in a similar way to the other Rio parsers/writers. This
> > is the culmination of the http://www.openrdf.org/issues/browse/SES-802
> > issue that Michele opened over a year ago.
> >
>
Really good initiatives, the only thing I would stress is to avoid breaking
the support
for IRI in N-Quads[0] present in the current Any23 version of the parser.

I know it is not compliant with the N-Quads standard but we introduced such
feature
because Sindice[1] (which uses Any23 to distill RDF content from collected
pages)
is constantly crawling a lot of N-Quads documents written with IRI encoding.

What I suggest as general approach is to add flags to enforce validation or
just to produce
warnings when non standard data is detected instead than avoid supporting
non fully standard data at all.

I would also suggest the promotion for a standard upgrade to pass from URI
to IRI support for N-Quads.
Richard, any advice about this?

The best.
Mic

[0] http://sw.deri.org/2008/07/n-quads/
[1] http://sindice.com/


> > Cheers,
> >
> > Peter
> >
> > [1] https://github.com/ansell/owlapi
> > [2] https://github.com/ansell/any23
> >
> > On 4 August 2012 12:25, Mattmann, Chris A (388J)
> > <chris.a.mattmann@jpl.nasa.gov> wrote:
> >> Hi Folks,
> >>
> >> A while back, the Any23 PPMC and the Incubator PMC VOTEd to add Peter
> Ansell
> >> to our ranks as a PPMC member and committer. Peter, welcome!
> >>
> >> Feel free to say a bit about yourself!
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
>
>
>
> --
> Lewis
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message