incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <>
Subject Re: Splitting up Any23 into a more modular format
Date Sat, 12 May 2012 16:34:15 GMT
Hi Peter,

Thanks for your help and for a detailed explanation of what you did!

I for one, would be super supportive if you had time to figure out a way
to get it into Apache Any23. I'm sure the rest of the PPMC would be happy
and willing to work with you to develop JIRA issues/patches, etc., to 
facilitate this.

Thank you again for your work!


On May 10, 2012, at 8:41 PM, Peter Ansell wrote:

> Hi all,
> Over the past two days I have split up Any23 into a variety of modules
> to make it easier to use different parts of the Any23 API. You can see
> the code at [1]. The current module list in the parent pom reactor
> looks like:
>  <modules>
>    <module>api</module>
>    <module>csvutils</module>
>    <module>encoding</module>
>    <module>mime</module>
>    <module>core</module>
>    <module>test-resources</module>
>    <module>extractor</module>
>    <module>cli</module>
>    <module>test</module>
>    <module>service</module>
>    <module>plugins/basic-crawler</module>
>    <module>plugins/html-scraper</module>
>    <module>plugins/office-scraper</module>
>    <module>plugins/integration-test</module>
>    <module>sources-dist</module>
>  </modules>
> All of the modules above core do not have dependencies on core, and
> the core module only has a dependency on the api module.
> The api module mostly contains interfaces but it also contains factory
> registries where they are fully Service Provider Interface (SPI)
> driven (Any23PluginManager and WriterFactoryRegistry which I created
> to alleviate the WriterRegistry hardcoding dependencies and
> reflection/annotation code that isn't easy to extend outside of the
> core library). The ExtractoryRegistry was too difficult to convert to
> SPI just yet so I split it up into an interface and an implementation
> (ExtractorRegistryImpl) with the interface in the API module and used
> in some APIs where the singleton was previously used. These
> registries, together with Rio RDFFormat for referencing RDF format
> information, seemed to be enough to remove the hardcoding that I have
> been discussing at
> The changes fit my purposes as I can easily slot in the encoding and
> mime detection code without pulling in the core or extractor modules,
> and the supported types for the mime detection include any formats I
> register with OpenRDF Rio so it is extensible and modular for my
> purposes.
> However, most of the changes are too large for easy patching and I
> didn't arrange the changes into nice patches throughout as I was not
> sure what was going to happen in the end. I have submitted two very
> small patches to that issue, but there could be many more eventually
> if the redesigned code is acceptable.
> Note, I also removed the Any23 NQuads implementation as it was missing
> Factory implementations for the writer and parser classes so it wasn't
> being picked up by Rio.createParser or any of the other static Rio
> methods. I replaced it with the NQuads implementation from Sesametools
> which includes these factories and so is recognised. When
> gets implemented both of
> these implementations will likely be deprecated anyway so it wasn't a
> major issue for me. I would suggest in either case splitting out the
> NQuads classes into a separate module and implementing a Factory for
> both the parser and writer so they are picked up by SPI.
> There were some existing broken tests when I started, and there were a
> small number of tests that broke throughout, including one that broke
> when I updated to Tika-1.1. They are temporarily ignored, but can be
> found easily by checking the ignored tests when running the test
> suite.
> I hope the changes are useful to others.
> If you want to suggest changes to my version on GitHub feel free to
> open an issue or fork the repository and send a pull request back.
> Cheers,
> Peter
> [1]

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

View raw message