lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Snels" <nick.sn...@gmail.com>
Subject How to best index user-generated content
Date Wed, 20 Sep 2006 09:44:36 GMT
Hi,

I want users to add content to my site using tinyMCE, which generates HTML.
When I tried adding the data to Solr, Solr refused to add it (or at least
generated an error):

SEVERE: org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
or TEXT to read text (position: START_TAG seen ...<field name="text"><p>...
@4:39)
    at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1071)
    at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:910)
    at org.apache.solr.core.SolrCore.update(SolrCore.java:685)
    at org.apache.solr.servlet.SolrUpdateServlet.doPost(
SolrUpdateServlet.java:52)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
    at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
    at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
    at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:126)
    at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:105)
    at org.apache.catalina.valves.RequestFilterValve.process(
RequestFilterValve.java:275)
    at org.apache.catalina.valves.RemoteAddrValve.invoke(
RemoteAddrValve.java:80)
    at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
    at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:148)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:869)
    at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
    at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
    at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
    at java.lang.Thread.run(Thread.java:595)

So I searched the archives to resolve this issue, since I didn't want to
strip out the HTML entirely. The solution proved to be to add <![CDATA[
around the HTML text, like so:

<add><doc>
   <field name="text"><![CDATA[#{field.text}]]></field>
</add></doc>

This also drew my attention to another problem, characters like < > & are
all 'invalid' characters between xml tags. So that would mean, I have to put
<![CDATA[ around all the fields I want to index!? Because I don't know or
cann't control what my users will input. Is this the only solution or is
their a way for Solr to handle these 'invalid' characters in the indexed
text by itself, without generating errors?

Kind regards,

Nick

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message