any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ANY23-280) Restructure ContentExtractor to improve extraction flexibility
Date Sat, 02 Apr 2016 20:01:25 GMT
Lewis John McGibbney created ANY23-280:
------------------------------------------

             Summary: Restructure ContentExtractor to improve extraction flexibility
                 Key: ANY23-280
                 URL: https://issues.apache.org/jira/browse/ANY23-280
             Project: Apache Any23
          Issue Type: Improvement
          Components: core, extractors
    Affects Versions: 1.1
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
            Priority: Critical
             Fix For: 1.2


As discussed on ANY23-247, the [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
is simply not fit for purpose. This issue was discovered and the cause has plagued our builds
ever since. Any extractors which implement [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
are based on the Extractor.ContentExtractor and hence work off of an 'unfixed' raw data stream
as oppose to a more flexible model such as the [TagSoupDOMExtractorhttps://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
This issue should restructure RDF extractors to enable more flexibility and to avoid issues
we encounter with the strict SAX parsing logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message