poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject [Bug 57463] OutOfMemeoryError while extracting text from DOCX files
Date Mon, 19 Jan 2015 13:21:06 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=57463

Nick Burch <apache@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Nick Burch <apache@gagravarr.org> ---
For XSSF, we have a low-level SAX+helper based way to extract text. It's more
work to code for, but low memory

Currently, we haven't had any volunteers to work on one for XWPF / .docx.
Because the basic structure of a .docx file is more flexible than .xlsx, I
suspect it'll be a bit more work to do, but shouldn't be impossible. Please
head over to the dev list if you're interested in working on this!

Otherwise, I wonder if it might be possible to lazy-load some parts of files
like that one, to help keep the memory footprint down. Are you able to profile
it to work out what xml elements are taking the most space? (We'll need to know
what part they come from, eg word/styles.xml, and what xml element within that,
eg w:rPr)

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message