Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 48139 invoked from network); 29 Apr 2010 15:26:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Apr 2010 15:26:50 -0000 Received: (qmail 63336 invoked by uid 500); 29 Apr 2010 15:26:50 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 63288 invoked by uid 500); 29 Apr 2010 15:26:49 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 63279 invoked by uid 99); 29 Apr 2010 15:26:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Apr 2010 15:26:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jawad.bokhari@gmail.com designates 209.85.222.183 as permitted sender) Received: from [209.85.222.183] (HELO mail-pz0-f183.google.com) (209.85.222.183) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Apr 2010 15:26:41 +0000 Received: by pzk13 with SMTP id 13so10581338pzk.13 for ; Thu, 29 Apr 2010 08:26:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=zqLtWK17C5IjagC75M9bY03H/XBHgjSzrhfhKUiaBLU=; b=acSEwOuYoezzqpVG2hETaBkl0W8AT3P9sa7Ozqlzg6XrPnvt6PaWk6U8unSJqxF6Nk IHmyfoVIxPyYj2Z8DFC4fwREbWA6IuTOP//vHm2DCabK8mV69LnGZt1zj1t1wPDZEFrr yfIqGKwOda0gIT26d64lmx52LPbcUSyTxp7Us= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=kM0NTx6r8EDoQiW7psA3GPqQfu5rHqqp9bPZo2+s+DY4EIQ6Rt4UEjq4ibB8r2q0nD IkJg7tDsKSiDcTcU9eGsaSPEJvaQUnAavI97Y47MbCyV7oS4CpJNHTwD9HRaDCedL738 9TkDu1ktCvuykOXQG3JjJEnmWJq//RUMkItr4= MIME-Version: 1.0 Received: by 10.142.119.1 with SMTP id r1mr1318358wfc.80.1272554779751; Thu, 29 Apr 2010 08:26:19 -0700 (PDT) Received: by 10.142.44.3 with HTTP; Thu, 29 Apr 2010 08:26:19 -0700 (PDT) Date: Thu, 29 Apr 2010 20:26:19 +0500 Message-ID: Subject: HTML Text Extraction fails even with Jackrabbit 2.1 From: Jawad Bokhari To: users@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=001636e0b6ba9ab4ad048561bf48 X-Virus-Checked: Checked by ClamAV on apache.org --001636e0b6ba9ab4ad048561bf48 Content-Type: text/plain; charset=ISO-8859-1 Hi All, Jackrabbit fails to extract text from HTML files. I see HTML as supported format at http://jackrabbit.apache.org/jackrabbit-text-extractors.html. But still it's not working. Is there anything I can do to fix this? Actually I experienced this for any XML documents that I tried to add to repository. 29.04.2010 20:21:51 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180) org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.html.HtmlParser@c81672 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:122) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.charset.IllegalCharsetNameException: at java.nio.charset.Charset.checkName(Charset.java:273) at java.nio.charset.Charset.lookup2(Charset.java:458) at java.nio.charset.Charset.lookup(Charset.java:437) at java.nio.charset.Charset.forName(Charset.java:502) at org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352) at org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75) at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:110) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:166) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) ... 11 more Thanks, Bokhari --001636e0b6ba9ab4ad048561bf48--