incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Ansell <ansell.pe...@gmail.com>
Subject Splitting up Any23 into a more modular format
Date Fri, 11 May 2012 06:41:40 GMT
Hi all,

Over the past two days I have split up Any23 into a variety of modules
to make it easier to use different parts of the Any23 API. You can see
the code at [1]. The current module list in the parent pom reactor
looks like:

  <modules>
    <module>api</module>
    <module>csvutils</module>
    <module>encoding</module>
    <module>mime</module>
    <module>core</module>
    <module>test-resources</module>
    <module>extractor</module>
    <module>cli</module>
    <module>test</module>
    <module>service</module>
    <module>plugins/basic-crawler</module>
    <module>plugins/html-scraper</module>
    <module>plugins/office-scraper</module>
    <module>plugins/integration-test</module>
    <module>sources-dist</module>
  </modules>

All of the modules above core do not have dependencies on core, and
the core module only has a dependency on the api module.

The api module mostly contains interfaces but it also contains factory
registries where they are fully Service Provider Interface (SPI)
driven (Any23PluginManager and WriterFactoryRegistry which I created
to alleviate the WriterRegistry hardcoding dependencies and
reflection/annotation code that isn't easy to extend outside of the
core library). The ExtractoryRegistry was too difficult to convert to
SPI just yet so I split it up into an interface and an implementation
(ExtractorRegistryImpl) with the interface in the API module and used
in some APIs where the singleton was previously used. These
registries, together with Rio RDFFormat for referencing RDF format
information, seemed to be enough to remove the hardcoding that I have
been discussing at https://issues.apache.org/jira/browse/ANY23-83

The changes fit my purposes as I can easily slot in the encoding and
mime detection code without pulling in the core or extractor modules,
and the supported types for the mime detection include any formats I
register with OpenRDF Rio so it is extensible and modular for my
purposes.

However, most of the changes are too large for easy patching and I
didn't arrange the changes into nice patches throughout as I was not
sure what was going to happen in the end. I have submitted two very
small patches to that issue, but there could be many more eventually
if the redesigned code is acceptable.

Note, I also removed the Any23 NQuads implementation as it was missing
Factory implementations for the writer and parser classes so it wasn't
being picked up by Rio.createParser or any of the other static Rio
methods. I replaced it with the NQuads implementation from Sesametools
which includes these factories and so is recognised. When
http://www.openrdf.org/issues/browse/SES-802 gets implemented both of
these implementations will likely be deprecated anyway so it wasn't a
major issue for me. I would suggest in either case splitting out the
NQuads classes into a separate module and implementing a Factory for
both the parser and writer so they are picked up by SPI.

There were some existing broken tests when I started, and there were a
small number of tests that broke throughout, including one that broke
when I updated to Tika-1.1. They are temporarily ignored, but can be
found easily by checking the ignored tests when running the test
suite.

I hope the changes are useful to others.

If you want to suggest changes to my version on GitHub feel free to
open an issue or fork the repository and send a pull request back.

Cheers,

Peter

[1] https://github.com/ansell/any23

Mime
View raw message