any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <>
Subject Re: Extraction of structure from non-XML based formats
Date Wed, 03 Jul 2013 19:43:32 GMT
Hi Peter,

On Wed, Jul 3, 2013 at 12:34 AM, <> wrote:

> I don't think there is much value in creating a pipeline structure inside
> of Any23 that doesn't use triples for interchange between pipeline stages.
> You may be able to come up with some reusable abstract classes to work with
> Tika more smoothly, but when it gets to emitting results from an extractor
> in Any23 I would recommend that you form the results into triples.

Yeah I see. The power that Any23 has here is that is does it job
(extracting vocabularies/structure) well. Unfortunately I am trending the
predictable, inherent, human weakness of always wanting more!!! Yesterday I
was able to export tables from my PDF's into Excel and then run the
office-scraper plugin. This is really adhoc, doesn't scale, and quite
frankly is a hellish workaround. There are however some improvements to
that plugin as a result which I am sure you will have seen so not all bad.

> Also, Sesame-2.7 deprecated "stopAtFirstError" and "verifyDataType" in
> favour of ParserConfig.addNonFatalError(... <setting that should not fail
> parsing> ...) and BasicParserSettings.VERIFY_DATATYPE_VALUES (and other
> similar settings), respectively. Not sure how tightly they are linked into
> Any23 as it has been a while since I went in and looked, but I noticed them
> in the patch so I thought I should mention that.
> Honestly, the patch I posted a while back was nothing more than the first
stage of this. It occurred to me, for some time, that extending the Any23
model would be really really useful for unstructured data... however this
is not exactly clear to me how we would work this... or if it is confusing
things down here.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message