incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michele Mostarda (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.
Date Sat, 21 Apr 2012 14:08:34 GMT

     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michele Mostarda resolved ANY23-76.
-----------------------------------

    Resolution: Fixed

Fixed @ r1328663.
                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the
DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's
a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing
url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone
who is more involved in the project can decide if it's a good idea to use the patch or not,
or maybe address this issue in another way..  The patch replaces commonly used XPath queries
with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the
time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message