poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 47742] New: The text extracted by WordExtractor is broken
Date Wed, 26 Aug 2009 14:42:13 GMT

           Summary: The text extracted by WordExtractor is broken
           Product: POI
           Version: 3.5-dev
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: r.w.online@gmx.org

--- Comment #0 from r.w.online@gmx.org 2009-08-26 07:42:12 PDT ---
Created an attachment (id=24169)
this JUnit3 test reproduces the bug, i.e. this test fails

We used the WordExtractor class to extract text from the attached Word

Unfortunately, the extracted text differs from the text seen in the Word

More precisely, some paragraphs appear twice and some text appears to be on the
wrong position.

We tried to track the error down to any part of the document but we could not
identify the part that caused the error. It looks like as the length of the
text or certain unicode characters cause the error but this is just guessing.

We attach a JUnit test case that reproduces the bug.

  ExtractTextFromWordDocumentTest.java - the Junit3 test case
  test.doc - the MS Word document from that we cannot extract the text properly
  test-EXTRACTED-BY-POI-WordExtractor.txt - the text extracted by POI
  test-SAVED-BY-MS-WORD.txt - the text as it is recognized by MS Word

Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

View raw message