Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 21677 invoked from network); 21 Dec 2006 02:01:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Dec 2006 02:01:45 -0000 Received: (qmail 95479 invoked by uid 500); 21 Dec 2006 02:01:50 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 95460 invoked by uid 500); 21 Dec 2006 02:01:50 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 95449 invoked by uid 99); 21 Dec 2006 02:01:50 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Dec 2006 18:01:50 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Dec 2006 18:01:42 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 54414714295 for ; Wed, 20 Dec 2006 18:01:22 -0800 (PST) Message-ID: <21869752.1166666482342.JavaMail.jira@brutus> Date: Wed, 20 Dec 2006 18:01:22 -0800 (PST) From: "Grant Ingersoll (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-589) Demo HTML parser doesn't work for international documents In-Reply-To: <7463835.1149693930727.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/LUCENE-589?page=all ] Grant Ingersoll updated LUCENE-589: ----------------------------------- Issue Type: Improvement (was: Bug) Description: Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick: Add the following line marked with a + to HTMLParser.jj: options { STATIC = false; OPTIMIZE_TOKEN_MANAGER = true; //DEBUG_LOOKAHEAD = true; //DEBUG_TOKEN_MANAGER = true; + UNICODE_INPUT = true; } was: Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick: Add the following line marked with a + to HTMLParser.jj: options { STATIC = false; OPTIMIZE_TOKEN_MANAGER = true; //DEBUG_LOOKAHEAD = true; //DEBUG_TOKEN_MANAGER = true; + UNICODE_INPUT = true; } Priority: Minor (was: Major) Decrease priority, mark as improvement, since it only affects demo. Also, I'm not sure we need to support other languages as this code should not be used in production anyway. > Demo HTML parser doesn't work for international documents > --------------------------------------------------------- > > Key: LUCENE-589 > URL: http://issues.apache.org/jira/browse/LUCENE-589 > Project: Lucene - Java > Issue Type: Improvement > Components: Examples > Affects Versions: 2.0.0 > Reporter: Curtis d'Entremont > Priority: Minor > > Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick: > Add the following line marked with a + to HTMLParser.jj: > options { > STATIC = false; > OPTIMIZE_TOKEN_MANAGER = true; > //DEBUG_LOOKAHEAD = true; > //DEBUG_TOKEN_MANAGER = true; > + UNICODE_INPUT = true; > } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org