Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9143C7E46 for ; Tue, 1 Nov 2011 21:44:08 +0000 (UTC) Received: (qmail 29759 invoked by uid 500); 1 Nov 2011 21:44:08 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 29729 invoked by uid 500); 1 Nov 2011 21:44:08 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 29721 invoked by uid 99); 1 Nov 2011 21:44:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Nov 2011 21:44:08 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nick.burch@alfresco.com designates 207.126.144.127 as permitted sender) Received: from [207.126.144.127] (HELO eu1sys200aog109.obsmtp.com) (207.126.144.127) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 01 Nov 2011 21:43:59 +0000 Received: from zimbra.alfresco.com ([88.151.129.3]) by eu1sys200aob109.postini.com ([207.126.147.11]) with SMTP; Tue, 01 Nov 2011 21:43:39 UTC Received: from localhost (localhost.localdomain [127.0.0.1]) by zimbra.alfresco.com (Postfix) with ESMTP id 977974140F4 for ; Tue, 1 Nov 2011 21:43:37 +0000 (GMT) X-Virus-Scanned: amavisd-new at unx-d-manc4.tc.ifeltd.com Received: from zimbra.alfresco.com ([127.0.0.1]) by localhost (zimbra.alfresco.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eS8hKhGbmajU for ; Tue, 1 Nov 2011 21:43:37 +0000 (GMT) Received: from urchin.earth.li (urchin.earth.li [212.13.204.73]) (Authenticated sender: nick.burch@alfresco.com) by zimbra.alfresco.com (Postfix) with ESMTP id 137914140F2 for ; Tue, 1 Nov 2011 21:43:37 +0000 (GMT) Date: Tue, 1 Nov 2011 21:43:36 +0000 (GMT) From: Nick Burch X-X-Sender: nick@urchin.earth.li To: dev@tika.apache.org Subject: Re: A problem in the right-to-left languages In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed On Tue, 1 Nov 2011, Robert Muir wrote: > Well as an alternative for them committing the ebcdic detection, perhaps > we could look at the Charset detection apis and propose some API > additions so that users (like Tika) can plug in custom detectors? In theory it should be pluggable, but I seem to recal we needed to tweak a few core bits to get the detector working (around negative matches for control characters) Looking at the svn version history, the ICU4J team don't appear to have done any work on their character detectors in several years. From the lack of responses when I asked on their list about extending them, I fear there may not be anyone left in their project who's interested in charset detectors any more. I'd love to be proved wrong though, if anyone has any personal contacts on the project they could prod about it? Nick