tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject HUG talk on Public Terabyte Dataset project
Date Fri, 23 Apr 2010 15:43:54 GMT
Hi all,

Just wrote a blog post about the talk I gave on Wed night at the  
Hadoop Bay Area user group meetup:


Key points for Tika are:

1. Tika worked well for processing the resulting HTML.

2. The sample analysis we did, on the quality of Tika charset  
detection, showed there was considerable room for improvement.

Though it's possible the analysis itself has bugs - I'll be doing more  
work on this in the next week or so.

A few additional notes about the accuracy figures:

- As I mentioned in my talk, they assume that if the HTML page  
contains a meta tag with the charset specified, this is accurate.  
Which isn't always the case, though from past experience it's been  
pretty good.

- The original percentages for UTF-8 and 8859-1 were lower, due to  
Tika sometimes reporting us-ascii for 8859-1 and UTF-8.  Since this  
would actually be correct if the page only contained 7-bit ascii, I  
treated us-ascii as an alias for UTF-8/ISO-8859-1.

- For some reason Tika often reports GB18030 as the charset for UTF-8.  
Given we were crawling sites with top US-based traffic, this seemed  
unlikely to be a valid result. I should try generated flipped results,  
where the charset is what Tika reports versus what the HTML page  

- We ignored all non-HTML files, and any HTML that didn't contain a  
meta tag with a valid charset name.

- We were using some not-yet-into-Tika code that cleans up some  
commonly broken charset names. This was based on data we generated a  
few months back, from a 5M page crawl.

Feel free to comment on the post if you have any questions or  
comments. Thanks!

-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message