any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility
Date Wed, 23 Aug 2017 12:35:00 GMT


ASF GitHub Bot commented on ANY23-280:

Github user jgrzebyta commented on a diff in the pull request:
    --- Diff: api/src/main/java/org/apache/any23/extractor/ ---
    @@ -39,22 +38,6 @@
          * This interface specializes an {@link Extractor} able to handle
    -     * {@link} as input format.
    -     */
    -    public interface ContentExtractor extends Extractor<InputStream> {
    --- End diff --
    @lewismc Why do you remove `ContentExtractor`? I assume that In case if content is neither
html nor xml type that developer should create new extractor extending `Exctractor<Input>`
directly. Am I right? 

> Refactor ContentExtractor to improve extraction flexibility
> -----------------------------------------------------------
>                 Key: ANY23-280
>                 URL:
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core, extractors
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>             Fix For: 2.1
> As discussed on ANY23-247, the [ContentExtractor|]
is simply not fit for purpose. This issue was discovered and the cause has plagued our builds
ever since. Any extractors which implement [BaseRDFExtractor|]
are based on the Extractor.ContentExtractor and hence work off of an 'unfixed' raw data stream
as oppose to a more flexible model such as the [TagSoupDOMExtractor|].
> This issue should refactor RDF extractors to enable more flexibility and to avoid issues
we encounter with the strict SAX parsing logic.

This message was sent by Atlassian JIRA

View raw message