Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 65980 invoked from network); 5 Nov 2010 07:45:35 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Nov 2010 07:45:35 -0000 Received: (qmail 1823 invoked by uid 500); 5 Nov 2010 07:46:05 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 1415 invoked by uid 500); 5 Nov 2010 07:46:03 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 1405 invoked by uid 99); 5 Nov 2010 07:46:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Nov 2010 07:46:02 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Nov 2010 07:46:01 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA57jfXB001290 for ; Fri, 5 Nov 2010 07:45:41 GMT Message-ID: <24363715.26951288943141359.JavaMail.jira@thor> Date: Fri, 5 Nov 2010 03:45:41 -0400 (EDT) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Resolved: (LUCENE-589) Demo HTML parser doesn't work for international documents MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-589. -------------------------------- Resolution: Fixed Fix Version/s: 4.0 3.1 Committed revision 1031460, 1031462 (3x) > Demo HTML parser doesn't work for international documents > --------------------------------------------------------- > > Key: LUCENE-589 > URL: https://issues.apache.org/jira/browse/LUCENE-589 > Project: Lucene - Java > Issue Type: Improvement > Components: Examples > Affects Versions: 2.0.0 > Reporter: Curtis d'Entremont > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-589.patch > > > Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick: > Add the following line marked with a + to HTMLParser.jj: > options { > STATIC = false; > OPTIMIZE_TOKEN_MANAGER = true; > //DEBUG_LOOKAHEAD = true; > //DEBUG_TOKEN_MANAGER = true; > + UNICODE_INPUT = true; > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org