any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lewismc <...@git.apache.org>
Subject [GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector
Date Thu, 08 Nov 2018 17:55:27 GMT
Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/131
  
    You've brought up an excellent topic for conversation. Tika currently has a batch, regression
job which essentially enables them to run over loads of documents and analyze the output.
The result being that they know how changes to the source are affecting Tika's ability to
do what it claims it is doing over time. We do not have that in Any23 but I think we should
make an effort to build bridges withe the Tika community in this regard with the aim of us
sharing resources (both available computing to run large batch parse jobs, as well as dataset(s)
we can use to run Any23 over.)
    
    I have been thinking for the longest time now about implementing a ```tika.triplify```
API which would encapsulate Any23 run it on the Tika data streams but I just never got around
to it. Maybe now is a better time to bring that idea back to life. 
    
    I was thinking we could possibly use common crawl but they do not publish the raw data
AFAIK it is the Nutch segments or some alternative e.g. the WebArchive files.


---

Mime
View raw message