Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 35513 invoked from network); 31 Oct 2010 10:05:49 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 31 Oct 2010 10:05:49 -0000 Received: (qmail 73575 invoked by uid 500); 31 Oct 2010 10:05:48 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 73285 invoked by uid 500); 31 Oct 2010 10:05:47 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 73213 invoked by uid 99); 31 Oct 2010 10:05:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 Oct 2010 10:05:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 Oct 2010 10:05:45 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o9VA5NlF013439 for ; Sun, 31 Oct 2010 10:05:23 GMT Message-ID: <7639845.161301288519523242.JavaMail.jira@thor> Date: Sun, 31 Oct 2010 06:05:23 -0400 (EDT) From: "Uwe Schindler (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2731) HyphenationCompoundWordTokenFilter fails to load DTD in Crimson parser (JDK 1.4) In-Reply-To: <10001301.161291288519312393.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926689#action_12926689 ] Uwe Schindler commented on LUCENE-2731: --------------------------------------- By the way, the whole loading in the TokenFilter is broken! XML files should never be load by a Reaer, always by InputStream. The charset detection is part of the XML spec and the default is UTF-8 if not overridden in the XML file. The HyphenationCompoundWordTokenFilter supplies ISO-8859-1 as fixed for the reader, so it is never possible to load other xml files with different charsets. This is another issue, I will open for 3.x and trunk (as it needs API change). > HyphenationCompoundWordTokenFilter fails to load DTD in Crimson parser (JDK 1.4) > -------------------------------------------------------------------------------- > > Key: LUCENE-2731 > URL: https://issues.apache.org/jira/browse/LUCENE-2731 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Reporter: Uwe Schindler > Assignee: Uwe Schindler > Fix For: 2.9.4 > > Attachments: LUCENE-2731.patch > > > HyphenationCompoundWordTokenFilter loads the DTD in its XML parser from memory by supplying EntityResolver. In Java 1.4 (affects Lucene 2.9, but also later versions if not Apache Xerces is used as XML parser) this does not work, because Cromson does not even ask the entity resolver, if no base URI is known. As the hyphenation file is loaded from Reader/InputStream no base URI is known. Crimson needs at least a non-null systemId to proceed. > This patch (Lucene 2.9 only) fakes this by supplying a fake systemId to the InputSource. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org