any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility
Date Tue, 31 Oct 2017 10:03:00 GMT


ASF GitHub Bot commented on ANY23-280:

Github user jgrzebyta commented on the issue:
    @lewismc  Regarding new low level interface is it planned any higher level interface?
I mean something what might be useful to create RDF graph fulfilling a custom ontology from
raw rdf graph. For example there is csv -> rdf extractor. But in practice that low level
rdf should be converted to the final one using at least one construct type SPARQL query. I
thought it might be possible to process that using programmable API. Unfortunately RDF4J QueryBuilder
API supports only simple queries.

> Refactor ContentExtractor to improve extraction flexibility
> -----------------------------------------------------------
>                 Key: ANY23-280
>                 URL:
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core, extractors
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>             Fix For: 2.2
> As discussed on ANY23-247, the [ContentExtractor|]
is simply not fit for purpose. This issue was discovered and the cause has plagued our builds
ever since. Any extractors which implement [BaseRDFExtractor|]
are based on the Extractor.ContentExtractor and hence work off of an 'unfixed' raw data stream
as oppose to a more flexible model such as the [TagSoupDOMExtractor|].
> This issue should refactor RDF extractors to enable more flexibility and to avoid issues
we encounter with the strict SAX parsing logic.

This message was sent by Atlassian JIRA

View raw message