Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F0D69454 for ; Sun, 6 Nov 2011 17:00:45 +0000 (UTC) Received: (qmail 10715 invoked by uid 500); 6 Nov 2011 17:00:45 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 10681 invoked by uid 500); 6 Nov 2011 17:00:45 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 10672 invoked by uid 99); 6 Nov 2011 17:00:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Nov 2011 17:00:45 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ahmad.ajiloo@gmail.com designates 209.85.220.182 as permitted sender) Received: from [209.85.220.182] (HELO mail-vx0-f182.google.com) (209.85.220.182) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Nov 2011 17:00:40 +0000 Received: by vcbfo14 with SMTP id fo14so3553083vcb.13 for ; Sun, 06 Nov 2011 09:00:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=c7yBvZy6ZUD+SIpsxD2uR1uHAGzHjNi5V08LSAglxBM=; b=nsMOj37mM7WKK1EQXsmA5/Rv0oU9xDr0FFf7HDdhwB+eztPwv0HRtxcxJTApWTKGNq iy1IxOSzkkPtpcUKMD8Ko101D8/MARRs/oKemxxksXJdSSINScoRSnui/H2YBXYyDirp SnxM+1Vfz8UwdfHoh2PLhAguAOvHe6q37Omeo= MIME-Version: 1.0 Received: by 10.52.65.77 with SMTP id v13mr15878428vds.95.1320598819875; Sun, 06 Nov 2011 09:00:19 -0800 (PST) Received: by 10.52.112.100 with HTTP; Sun, 6 Nov 2011 09:00:19 -0800 (PST) In-Reply-To: <8CE23A9B-4EE5-44B9-B06B-E9FA4046B393@transpac.com> References: <8CE23A9B-4EE5-44B9-B06B-E9FA4046B393@transpac.com> Date: Sun, 6 Nov 2011 20:30:19 +0330 Message-ID: Subject: Re: A problem in the right-to-left languages From: Ahmad Ajiloo To: dev@tika.apache.org Content-Type: multipart/alternative; boundary=20cf307f31148ca97d04b113df01 --20cf307f31148ca97d04b113df01 Content-Type: text/plain; charset=ISO-8859-1 Hi Did your probe conclude a result? On Wed, Nov 2, 2011 at 4:40 AM, Ken Krugler wrote: > I know some of the original team members - I could ask. > > Are there specific questions, or just "is anybody still minding the fire"? > > -- Ken > > On Nov 1, 2011, at 2:43pm, Nick Burch wrote: > > > On Tue, 1 Nov 2011, Robert Muir wrote: > >> Well as an alternative for them committing the ebcdic detection, > perhaps we could look at the Charset detection apis and propose some API > additions so that users (like Tika) can plug in custom detectors? > > > > In theory it should be pluggable, but I seem to recal we needed to tweak > a few core bits to get the detector working (around negative matches for > control characters) > > > > Looking at the svn version history, the ICU4J team don't appear to have > done any work on their character detectors in several years. From the lack > of responses when I asked on their list about extending them, I fear there > may not be anyone left in their project who's interested in charset > detectors any more. I'd love to be proved wrong though, if anyone has any > personal contacts on the project they could prod about it? > > > > Nick > > -------------------------- > Ken Krugler > http://bixolabs.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > > --20cf307f31148ca97d04b113df01--