any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility
Date Tue, 31 Oct 2017 10:03:00 GMT

    [ https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226543#comment-16226543
] 

ASF GitHub Bot commented on ANY23-280:
--------------------------------------

Github user jgrzebyta commented on the issue:

    https://github.com/apache/any23/pull/24
  
    @lewismc  Regarding new low level interface is it planned any higher level interface?
I mean something what might be useful to create RDF graph fulfilling a custom ontology from
raw rdf graph. For example there is csv -> rdf extractor. But in practice that low level
rdf should be converted to the final one using at least one construct type SPARQL query. I
thought it might be possible to process that using programmable API. Unfortunately RDF4J QueryBuilder
API supports only simple queries.


> Refactor ContentExtractor to improve extraction flexibility
> -----------------------------------------------------------
>
>                 Key: ANY23-280
>                 URL: https://issues.apache.org/jira/browse/ANY23-280
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core, extractors
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>             Fix For: 2.2
>
>
> As discussed on ANY23-247, the [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
is simply not fit for purpose. This issue was discovered and the cause has plagued our builds
ever since. Any extractors which implement [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
are based on the Extractor.ContentExtractor and hence work off of an 'unfixed' raw data stream
as oppose to a more flexible model such as the [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
> This issue should refactor RDF extractors to enable more flexibility and to avoid issues
we encounter with the strict SAX parsing logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message