lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murray Altheim <m.alth...@open.ac.uk>
Subject Re: encoding of german analyzer source files
Date Fri, 26 Nov 2004 22:40:00 GMT
I wrote:
[...]> You can sniff the first few bytes (which is what is recommended
> in the XML 1.0 spec, you can see how they do it there), but making
> such an assumption may lead to program failure if the assumption
> is incorrect.
> 
>    Extensible Markup Language (XML) 1.0 (Third Edition)
>    Appendix F Autodetection of Character Encodings
>    http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing
> 
> The suggestions there are pretty usable for files that have nothing
> to do with XML.

I neglected to mention that the XML method relies on the beginning
of the file starting with "<?xml". In the case of source files for
the Lucene project, the beginnings of the files are likely one of
three:

    "package..."
    "..."     (some form of whitespace)
    "/*"      (the beginning of an Apache License)
    "<html"   (beginning of HTML file)
    <!DOCTYPE (beginning of HTML file)

It wouldn't be too hard to write a sniffer for this. I think most
all of the Lucene source starts with "package", and if not, it
certainly could.

In grepping through the source I noted nine instances of a lowercase
use of "<!doctype", which isn't valid. This should probably be registered
as a bug. Kinda makes me wonder what's generating that, because when
I run javadoc on my own stuff this doesn't happen.

org/apache/lucene/util/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/index/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/store/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/queryParser/package.html:<!doctype html public "-//w3c//dtd html 4.0
transitional//en">
org/apache/lucene/search/spans/package.html:<!doctype html public "-//w3c//dtd html 4.0
transitional//en">
org/apache/lucene/search/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/document/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/analysis/standard/package.html:<!doctype html public "-//w3c//dtd html
4.0 transitional//en">
org/apache/lucene/analysis/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   [International Committee of the Red Cross director] Kraehenbuhl
   pointed out that complying with international humanitarian law
   was "an obligation, not an option", for all sides of the conflict.
   "If these rules or any other applicable rules of international
   humanitarian law are violated, the persons responsible must be
   held accountable for their actions," he said. -- BBC News
   http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm

  "In my judgment, this new paradigm [the War on Terror] renders
   obsolete Geneva's strict limitations on questioning of enemy
   prisoners and renders quaint some of its provisions [...]
   Your determination [that the Geneva Conventions] does not apply
   would create a reasonable basis in law that [the War Crimes Act]
   does not apply, which would provide a solid defense to any future
   prosecution." -- Alberto Gonzalez, appointed US Attorney General,
   and likely Supreme Court nominee, in a memo to George W. Bush
   http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message