poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Stadler <dominik.stad...@gmx.at>
Subject Re: Using CommonCrawl for POI regression-mass-testing
Date Thu, 14 Jan 2016 20:57:40 GMT
Hi,

wow, nice slides! I am not working as sophisticated as you on this one, but
rather  focused on finding regressions and catastrophic failures of POI for
now, because the large number of failures is hard to sort into actual
failures and other things. I think one of the next steps will be to filter
out the obvious cases, i.e. wrong mime-types and HTML-pages, which seem to
be quite common to see if I can get down the list of actual failures to a
more manageable size.

I did not know about the 1MB limit in CommonCrawl, but again for the
current regression testing this is not a big issue, the files will likely
simply fail in both versions of POI. It might become interesting later on,
but one could try to re-download the file from the original source if it is
possible to detect that it was cut and it is still available at the
original URL...

I though about possible code-pieces to share, part of my code is located in
the project https://github.com/centic9/CommonCrawlDocumentDownload, which I
enhanced to also download from newer crawls, not only the "old index" from
a few years back.

It's split into a multi-step process, first retrieving the list of URLs and
their position in the crawl as large JSON-file, then using that information
to actually download the files.

The processing with POI and populating of the database is done in a
separate project which I did not publish (yet), again the handling is done
in multiple steps, first actually running against POI, writing results to a
JSON-file again. And then writing the results to the database in a second
step. This makes it possible to "fix" database writing without the very
lengthy processing (more than 12 hours for the 180G worth of POI-relevant
files on my laptop).

Dominik.


On Thu, Jan 14, 2016 at 4:57 PM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> Sweet!  Please feel free to make any use that you can out of [0].
>
> Y, I’m storing results in a db as well (h2) and using that to dump reports
> along the lines of [1]…note I’m using POI to generate xlsx files now ☺.
>
> Is there any way we could collaborate on the eval code?  My active dev
> (when I have a chance) is on the TIKA-1302 branch of my Tika github fork.
> The goal is to eventually contribute that as a tika-eval module.
>
> If you wanted access to our vm, I’d be more than happy to grant access so
> we can collaborate on the corpus and the eval stuff.
>
> Oh, as for Common Crawl, as you already know, in addition to the incorrect
> mime types, etc…one of the big things that’s been something to be aware of
> is that they truncate their files at 1MB, which is a big problem for file
> formats that tend to be bigger than that.  Are you pulling only
> non-truncated files?
>
> Again, this is fantastic!  What can we share/collaborate on?
>
> Cheers,
>
>            Tim
>
>
> [0]
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
> [1]
> https://issues.apache.org/jira/secure/attachment/12782054/reports_pdfbox_1_8_11-rc1.zip
>
> From: Dominik Stadler [mailto:dominik.stadler@gmx.at]
> Sent: Wednesday, January 13, 2016 2:09 PM
> To: POI Developers List <dev@poi.apache.org>
> Subject: Using CommonCrawl for POI regression-mass-testing
>
> Hi,
> FYI, I am playing with CommonCrawl data for some talk that I plan to do in
> 2016. As part of this I built a small framework to let me run the POI
> integrationtest-framework on a large number of documents that I extracted
> from a number of CommonCrawl-runs. This is somewhat similar to what Tim is
> doing for Tika, but it focues on POI-related documents.
> I tried to use this as a huge regression-check, in this case I compared
> relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this
> against newer versions to check for any new regressions.
>
>
> Some statistics:
> * Overall I processed 829356 POI-related documents
>
> * 687506 documents did process fine in both versions!
> * 140699 documents caused parsing errors in both versions. Many of these
> are actually invalid documents, wrong file-types, incorrect mime-types, ...
> so the actuall error rate would be much lower, but it is currently not
> overly useful to look at these errors without first sorting out all the
> false-positives.
>
> * 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made
> more documents succeed now, jay!
>
> * And finally 306 documents did fail in POI-3.14-beta1 while they
> processed fine with POI-3.13.
>
>
> However these potential regressions have the following causes:
>
> ** aprox 280 of these were caused because we do more checks for HSLF now
> ** 19 were OOMs that happen in my framework with large documents due to
> parallel processing
> ** One document fails Date-parsing where I don't see how it did work
> before, maybe this is also caused by more testing now
> ** 5 documents failed due to the new support for multi-part formats and
> locale id
> ** One document showed an NPE in HSLFTextParagraph
>
> So only the last two look like actual regressions, I will commit fixes
> together with reproducing files for these two shortly.
>
> I store the results into a database, so I can query on the results in
> various ways:
>
> E.g. attached is the list of top 100 exception-messages for the failed
> files.
>
> Let me know if you would like to get a full stacktrace and document for
> any of those or if you have suggestions for additional queries/checks that
> we could add here!
>
> Dominik.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message