commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <>
Subject [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?
Date Tue, 07 Apr 2015 12:48:55 GMT

  We just heard back from a very active member of Common Crawl.  I don’t want to clog up
our dev lists with this discussion (more than I have!), but I do want to invite all to participate
in the discussion, planning and potential patches.

  If you’d like to participate, please join us here:!topic/common-crawl/Cv21VRQjGN0

  I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject
line.  Please invite others who might have an interest in this work.



From: Allison, Timothy B.
Sent: Tuesday, April 07, 2015 8:39 AM
To: 'Stephen Merity';
Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?


  Thank you very much for responding so quickly and for all of your work on Common Crawl.
 I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some
of the dev communities, I think we would very much appreciate the chance to be tested on a
monthly basis as part of the regular Common Crawl process.

   I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl
we have, but the monthly testing against new data, from my perspective at least, would be
a huge win for all of us.

   In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others)
can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of

  I’ll forward this to some of the relevant dev lists to invite others to participate in
the discussion on the common-crawl list.

  Thank you, again.  I very much look forward to collaborating.



From: Stephen Merity []
Sent: Tuesday, April 07, 2015 3:57 AM
Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?

Hi Tika team!

We'd certainly be interested in working with Apache Tika on such an undertaking. At the very
least, we're glad that Julien has provided you with content to battle test Tika with!

As you've noted, the text extraction performed to produce WET files are focused primarily
on HTML files, leaving many other file types not covered. The existing text extraction is
quite efficient and part of the same process that generates the WAT file, meaning there's
next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be
an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge
and would also give Tika the benefit of a wider variety of documents (both well formed and
malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline
would be more challenging but something we can certainly discuss, especially if there's strong
community desire for it!

On Fri, Apr 3, 2015 at 5:23 AM, <<>>
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess
is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the
current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<>
on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
run Tika as part of its regular process and make the output available in one of your standard

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies,
PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.


You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to<>.
To post to this group, send email to<>.
Visit this group at
For more options, visit

Stephen Merity
Data Scientist @ Common Crawl
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message