tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: A problem in the right-to-left languages
Date Tue, 01 Nov 2011 21:43:36 GMT
On Tue, 1 Nov 2011, Robert Muir wrote:
> Well as an alternative for them committing the ebcdic detection, perhaps 
> we could look at the Charset detection apis and propose some API 
> additions so that users (like Tika) can plug in custom detectors?

In theory it should be pluggable, but I seem to recal we needed to tweak a 
few core bits to get the detector working (around negative matches for 
control characters)

Looking at the svn version history, the ICU4J team don't appear to have 
done any work on their character detectors in several years. From the lack 
of responses when I asked on their list about extending them, I fear there 
may not be anyone left in their project who's interested in charset 
detectors any more. I'd love to be proved wrong though, if anyone has any 
personal contacts on the project they could prod about it?

Nick

Mime
View raw message