lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Indexing XML files
Date Tue, 05 Dec 2006 19:23:14 GMT

Since XML is the transport for sending data to Solr, you need to make sure
all field values are XML escaped.

If you wanted to index a plain text "title" and that tile contained an
ampersand character....

	Sense & Sensability

...you would need to XML escape that as...

	Sense &amp; Sensability

...Solr internally will treat that consistently as the JAva string "Sense
& Sensability" and when it comes time to return that string back to your
query clients, will output it in whatever form is appropraite for your
ResponseWriter -- if that's XML, then it will be XML escaped again, if
it's JSON or something ike it, it can probably be left alone.

The same holds tru for any other characters you wna to include in your
field values: Solr doens't care that they *value* itself is an XML string,
just that you properly escape the value in your XML <add><doc> message to
Solr...

 <add>
  <doc>
   <field name="title">As You Like it</field>
   <field name="author">Shakespeare, William</field>
   <field name="record">&lt;myxml&gt;here goes the xml...&lt;/myxml&gt;</field>
  </doc>
 </add>

...does that make sense?

: Ideally, I would like to store the xml as is, and index only the content
: removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
: that).
: And output the result as an xml (so, simple escaping does not work for me).

the escaping is just to send the data to Solr -- once sent, Solr will
process the unescaped string when deailing with analyzers, etc exactly as
you'd expect.


-Hoss


Mime
View raw message