poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 51524] New: PapBinTable constructor is slow
Date Mon, 18 Jul 2011 14:52:43 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=51524

             Bug #: 51524
           Summary: PapBinTable constructor is slow
           Product: POI
           Version: 3.8-dev
          Platform: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: antoni.mylka@aduna-software.com
    Classification: Unclassified


The current (r1147828) constructor of the PapBinTable class does something
like:

List<PAPX> newPapxs = new LinkedList<PAPX>();
foreach character in docText
   do something
   List<PAPX> papxs = new LinkedList<PAPX>();
   foreach paragraph in paragraphs
      do something with papxs
   do something with papxs and newPapxs
set this.paragraphs to newPapxs

The problem is that the overall complexity rises quadratically with the
document size. For instance I have a document which has 341742 paragraphs and
docText at this point is 653186 characters log. I didn't even have the patience
to wait until it finishes. 

In 3.7, this constructor was much simpler, this.paragraphs was not transformed.
This introduced a performance regression. We had an experiment where we
processed and indexed the content of some doc files. The time rose from 9 to 54
minutes between 3.7 and 3.8.beta3. 

The document I talked about comes from the govdocs dataset. It's public.

http://domex.nps.edu/corp/files/govdocs1/007/007488.doc

There is probably a good reason for this, but the performance regression is
significant and the previous version seems to have worked well enough. Maybe
this transformation could be disabled with some switch or a system property.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message