Return-Path: X-Original-To: apmail-incubator-any23-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-any23-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B16CAC3B3 for ; Sun, 22 Apr 2012 07:19:54 +0000 (UTC) Received: (qmail 47626 invoked by uid 500); 22 Apr 2012 07:19:54 -0000 Delivered-To: apmail-incubator-any23-dev-archive@incubator.apache.org Received: (qmail 47584 invoked by uid 500); 22 Apr 2012 07:19:54 -0000 Mailing-List: contact any23-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: any23-dev@incubator.apache.org Delivered-To: mailing list any23-dev@incubator.apache.org Received: (qmail 47561 invoked by uid 99); 22 Apr 2012 07:19:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Apr 2012 07:19:54 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Apr 2012 07:19:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id DC61B4078E2 for ; Sun, 22 Apr 2012 07:19:29 +0000 (UTC) Date: Sun, 22 Apr 2012 07:19:29 +0000 (UTC) From: "Hudson (JIRA)" To: any23-dev@incubator.apache.org Message-ID: <524010868.2384.1335079170024.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1933402351.28153.1334572696897.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259032#comment-13259032 ] Hudson commented on ANY23-76: ----------------------------- Integrated in Any23-trunk #178 (See [https://builds.apache.org/job/Any23-trunk/178/]) Improved HCardExtractor performances. Related to issue #ANY23-76 . (Revision 1328663) Result = UNSTABLE mostarda : Files : * /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/DomUtils.java * /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/HCardExtractor.java * /incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/html/HCardExtractorTest.java * /incubator/any23/trunk/core/src/test/resources/microformats/hcard/performance.html > Improve runtime of the Microformat extractor on documents with many relations. > ------------------------------------------------------------------------------ > > Key: ANY23-76 > URL: https://issues.apache.org/jira/browse/ANY23-76 > Project: Apache Any23 > Issue Type: Improvement > Reporter: Timothy Potter > Assignee: Michele Mostarda > Priority: Trivial > Attachments: MicroformatSpeed.patch > > > For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow. I've market this as trivial as it's a corner case. > To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url: > http://en.wikipedia.org/wiki/List_of_Nike_missile_locations > I include a patch that improves performance at the cost of code simplicity. I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way.. The patch replaces commonly used XPath queries with DOM tree traversals. Eg. getting all nodes with 'class' attributes. On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira