poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Beeker <kiwiwi...@apache.org>
Subject Re: Test document Tika-792
Date Sun, 03 Dec 2017 21:15:20 GMT
There's also a third (and probably more ...) option:
You could set an option/enum/boolean as a ThreadLocal to decide how to output the data.
The default might be not to return deleted text and Tika could change it when calling the
extractor.

As the library would fill up with ThreadLocals, I wonder if we could have a central configuration
ThreadLocal (or similar) instance?

Andi

On 12/3/17 1:34 AM, Mark Murphy wrote:
> I am working on Bug 61787 where documents containing rsidDel=000000 are not
> extracting the correct text. The issue is that rsidxxx attributes are just
> there to indicate which revision a particular change belongs to, but does
> not necessarily indicate that a particular revision actually occurred. Bug
> 58067 corrected an issue where deleted text was being returned by the
> XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
> keying on the rsidDel attribute rather than the delText tag which
> specifically means that this is the deleted text.
>
> So I corrected this in XWPFParagraph and XWPFRun. Now a test on document
> Tika-792 is failing because it is expecting getText to return deleted text.
> So what do you want to do?
>
> The options as I see them are
>
>    1. To allow getText() to return all text, even deleted text, than add
>    another method to only return undeleted text.
>    2. Change getText() to return undeleted text, and add another method to
>    retrieve all text.
>
> I prefer the second option, and I suspect that the Tika test is not
> particularly valid as it's comment is that it's purpose is to include
> CTBookmark classes in ooxmlLite. Tim, do you have a preference here as my
> change will likely affect you the most.
>



Mime
View raw message