jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jawad Bokhari <jawad.bokh...@gmail.com>
Subject HTML Text Extraction fails even with Jackrabbit 2.1
Date Thu, 29 Apr 2010 15:26:19 GMT
Hi All,

Jackrabbit fails to extract text from HTML files. I see HTML as supported
format at http://jackrabbit.apache.org/jackrabbit-text-extractors.html.
But still it's not working. Is there anything I can do to fix this?
Actually I experienced this for any XML documents that I tried to add to
repository.

29.04.2010 20:21:51 *WARN * LazyTextExtractorField: Failed to extract text
from a binary property (LazyTextExtractorField.java, line 180)
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.html.HtmlParser@c81672
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:122)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
        at
org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.nio.charset.IllegalCharsetNameException:
        at java.nio.charset.Charset.checkName(Charset.java:273)
        at java.nio.charset.Charset.lookup2(Charset.java:458)
        at java.nio.charset.Charset.lookup(Charset.java:437)
        at java.nio.charset.Charset.forName(Charset.java:502)
        at
org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352)
        at
org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75)
        at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:110)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:166)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
        ... 11 more


Thanks,
Bokhari

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message