lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (LUCENE-589) Demo HTML parser doesn't work for international documents
Date Fri, 05 Nov 2010 07:27:42 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir reassigned LUCENE-589:
----------------------------------

    Assignee: Robert Muir

> Demo HTML parser doesn't work for international documents
> ---------------------------------------------------------
>
>                 Key: LUCENE-589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-589
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Examples
>    Affects Versions: 2.0.0
>            Reporter: Curtis d'Entremont
>            Assignee: Robert Muir
>            Priority: Minor
>
> Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would
read the charset from the HTML markup, but that can by tricky. For now assuming unicode would
do the trick:
> Add the following line marked with a + to HTMLParser.jj:
> options {
>   STATIC = false;
>   OPTIMIZE_TOKEN_MANAGER = true;
>   //DEBUG_LOOKAHEAD = true;
>   //DEBUG_TOKEN_MANAGER = true;
> +  UNICODE_INPUT = true;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message