pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Release 2.0.15 ?
Date Sat, 06 Apr 2019 15:19:09 GMT
I looked at about 10 files... all are rotated. I suspect this is a 
result of PDFBOX-4480, that previously some rotated words came as one. 
But this doesn't matter, the overall extraction of rotated pages would 
still look bad.

For example, the file you mention extracted this in 2.0.14:

...
R
E
R
M
H
IV
-1
infection
hum
an(B
8)
[G
oulder97c]
...

So it had "infection" but the rest was still worthless. The same file 
extracts nicely with the "rotationMagic" option of ExtractText.

Tilman

Am 06.04.2019 um 15:50 schrieb Tim Allison:
> http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz
>
> This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
> there were no content differences btwn 2.0.13 and 2.0.14.  I did not
> apply angle detection.
>
> No new exceptions; 2 fixed exceptions.  We're getting higher page
> counts in a few documents, because we overrode processPages() to
> process.  Some changes in content, but overall, better, I think, based
> on contents/common_token_comparisons_by_mime.xlsx.
>
> To see where content appears to degrade, open
> contents/content_diffs_(no|with)_exceptions, and sort column M
> ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
> columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
> (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
> frequent tokens that are unique to A or unique to B; from this, it
> looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
> but, generally (hand waving), it appears that there were word
> segmentation problems in both A and B as I look at the results.
>
> Cheers,
>
>               Tim
>
> On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <tallison@apache.org> wrote:
>> +1 I should have regression results by tomorrow
>>
>> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sahyoun@fileaffairs.de> wrote:
>>> +1
>>>
>>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>>>>
>>>> Hi,
>>>>
>>>> looks like it's time for the next release. How about cutting 2.0.15 next
monday?
>>>>
>>>> WDYT?
>>>>
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message