any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-324) Replace net.sourceforge.nekohtml with jsoup
Date Wed, 24 Jan 2018 11:09:00 GMT

    [ https://issues.apache.org/jira/browse/ANY23-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337404#comment-16337404
] 

ASF GitHub Bot commented on ANY23-324:
--------------------------------------

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/58
  
    @lewismc Yeah, I just realized that this PR fixes none of the issues we thought it would...
because the TagSoupParser is not what was causing the problem... the semargl parser is causing
the problem. Don't worry, I've got another PR coming shortly!


> Replace net.sourceforge.nekohtml with jsoup 
> --------------------------------------------
>
>                 Key: ANY23-324
>                 URL: https://issues.apache.org/jira/browse/ANY23-324
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>            Reporter: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.2
>
>
> A long standing issue relates to the performance of the existing default [TagSoupParser.java|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java].
There are a number of issues which now relate to limitations in the way nekohtml parses HTML5
for example [ANY23-317|https://issues.apache.org/jira/browse/ANY23-317], [ANY23-273|https://issues.apache.org/jira/browse/ANY23-273],
[ANY23-267|https://issues.apache.org/jira/browse/ANY23-267]... there are several others.
> I propose to @Deprecate the TagSoupParser.java implementation for the next release (possibly
making it configurable via default-configuration.properties). I also propose to replace it
with https://jsoup.org/. AFAIK, Apache Tika also did this several years ago.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message