poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject FW: Any interest in running Apache Tika as part of CommonCrawl?
Date Fri, 03 Apr 2015 14:28:49 GMT

What do you think?


On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com<mailto:talliso...@gmail.com>
CommonCrawl currently has the WET format that extracts plain text from web pages.  My guess
is that this is text stripping from text-y formats.  Let me know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the
current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302>
on a Rackspace vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
run Tika as part of its regular process and make the output available in one of your standard

CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies,
PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes.



To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

View raw message