lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-589) Demo HTML parser doesn't work for international documents
Date Thu, 21 Dec 2006 02:01:22 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-589?page=all ]

Grant Ingersoll updated LUCENE-589:
-----------------------------------

     Issue Type: Improvement  (was: Bug)
    Description: 
Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read
the charset from the HTML markup, but that can by tricky. For now assuming unicode would do
the trick:

Add the following line marked with a + to HTMLParser.jj:

options {
  STATIC = false;
  OPTIMIZE_TOKEN_MANAGER = true;
  //DEBUG_LOOKAHEAD = true;
  //DEBUG_TOKEN_MANAGER = true;
+  UNICODE_INPUT = true;
}


  was:

Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read
the charset from the HTML markup, but that can by tricky. For now assuming unicode would do
the trick:

Add the following line marked with a + to HTMLParser.jj:

options {
  STATIC = false;
  OPTIMIZE_TOKEN_MANAGER = true;
  //DEBUG_LOOKAHEAD = true;
  //DEBUG_TOKEN_MANAGER = true;
+  UNICODE_INPUT = true;
}


       Priority: Minor  (was: Major)

Decrease priority, mark as improvement, since it only affects demo.  Also, I'm not sure we
need to support other languages as this code should not be used in production anyway. 

> Demo HTML parser doesn't work for international documents
> ---------------------------------------------------------
>
>                 Key: LUCENE-589
>                 URL: http://issues.apache.org/jira/browse/LUCENE-589
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Examples
>    Affects Versions: 2.0.0
>            Reporter: Curtis d'Entremont
>            Priority: Minor
>
> Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would
read the charset from the HTML markup, but that can by tricky. For now assuming unicode would
do the trick:
> Add the following line marked with a + to HTMLParser.jj:
> options {
>   STATIC = false;
>   OPTIMIZE_TOKEN_MANAGER = true;
>   //DEBUG_LOOKAHEAD = true;
>   //DEBUG_TOKEN_MANAGER = true;
> +  UNICODE_INPUT = true;
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message