lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: "Advanced" query language
Date Tue, 06 Dec 2005 16:54:33 GMT
Yonik Seeley wrote:
> On 12/6/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>>Also I'd be curious to see a problem with Unicode code points in XML,
>>if you have one handy.
> 
> The definition of valid XML 1.0 characters:
> #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> 
> The simplest example is code-point 0.  It's a valid unicode character,
> but it's not a valid XML character (even when you replace it with an
> entity).
> Example: <tag>NullTerminated&#0;</tag>  is not valid XML

Are you aware, though, of an existing Unicode serialization/markup 
mechanism without XML's gaps?

>>I'm confident that XML can accommodate our needs just fine, and any
>>other text transmission would have to re-solve many issues that XML
>>has already solved.
> 
> Agreed.  It wasn't a blocker, but it was something I wanted to see
> tackled up front.  It means adding a little more application logic to
> handle escaping/unescaping.
> 
> The bottom line is I want to be able to represent the perfectly valid
> lucene query new TermQuery(new Term("field","\u0000")).

Base64 is frequently used as an escape mechanism for binary data in XML. 
  It has the nice property that it can be used directly as XML character 
data, since its standard representation does not use any XML metacharacters.

One possible solution to the escaping issue is a standard optional 
attribute named "encoding", the value of which could be extensible, with 
value "base64" built into the initial implementation.  Then, unless the 
attribute is present, all data is taken literally.  E.g. (taking Yonik's 
example 'TermQuery(new Term("field","\u0000"))'):

<TermQuery>
   <Term field="field" encoding="base64">AA==</Term>
</TermQuery>

Note that this solution would limit the serialization syntax, though, 
because unless there is a single attribute name for possibly-escaped 
data (very unlikely, methinks), escapable text would only be 
representable as text node children of elements, and *not* as attribute 
values.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message