lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Belenki <v...@belenki.name>
Subject Problem while indexing XML file with special characters represented &uuml
Date Fri, 06 Jul 2012 08:58:59 GMT
Dear community,

I am experiencing strange problem while trying to index / to import XML
document to SOLR via DataImportHandler. The XML document contains some
special characters (e.g. german ΓΌ) that are represented as XML entities
&uuml; or &auml;. There is also DTD file that defines these entities
(<!ENTITY uuml    "&#252;" >) (I tried to use dtd file as well as to
include the DTD definition to the xml itself). After I start the import
command full-import, the import process throws an exception as soon as it
tries to parse &uuml;: "Un
declared general entity "uuml". Did anyone already face such a problem? 

best regards,

Michael


My data-config for importing is:


<dataConfig>
        <dataSource type="FileDataSource" encoding="ISO-8859-1" />
        <document>
		<!--  stream should be true since huge xml document is being parsed -->
        <entity name="article"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/dblp/article"
                url="documents/dblp.xml"

                >
            <field column="key"        xpath="/dblp/article/@key" />
            <field column="title"     xpath="/dblp/article/title" />
			
			
       </entity>
        </document>
</dataConfig>

The XML file looks e.g. like this:

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE dblp [

    <!ENTITY uuml    "&#252;" ><!-- small u, dieresis or umlaut mark -->
]>
<dblp>

<article key="journals/fm/Riccardi09" mdate="2011-10-27">
<author>Marco Riccardi</author>
<title>Solution of Cubic and Quartic Equations.&uuml;</title>
<pages>117-122</pages>
<year>2009</year>
<volume>17</volume>

<journal>Formalized Mathematics</journal>

<number>1-4</number>
<ee>http://dx.doi.org/10.2478/v10037-009-0012-z</ee><url>db/journals/fm/fm17.html#Riccardi09</url>
</article></dblp>

The stack-trace is:

05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
05.07.2012 17:37:19 org.apache.solr.common.SolrException log
SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeE
xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
Parsing
 failed for xml, url:documents/dblp.xml rows processed in this xml:2 last
row in
 this xml:{title=Common Subexpression Identification in General Algebraic
System
s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:264)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
rter.java:375)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
ava:445)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
va:426)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataIm
portHandlerException: Parsing failed for xml, url:documents/dblp.xml rows
proces
sed in this xml:2 last row in this xml:{title=Common Subexpression
Identificatio
n in General Algebraic Systems., $forEach=/dblp/article,
key=persons/Hall74} Pro
cessing Document # 3
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:621)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
ava:327)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:225)
        ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
Parsin
g failed for xml, url:documents/dblp.xml rows processed in this xml:2 last
row i
n this xml:{title=Common Subexpression Identification in General Algebraic
Syste
ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
        at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
Throw(DataImportHandlerException.java:72)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:504)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:517)
        at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
ProcessorBase.java:120)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
XPathEntityProcessor.java:225)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath
EntityProcessor.java:204)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent
ityProcessorWrapper.java:330)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
ityProcessorWrapper.java:296)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:683)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:619)
        ... 5 more
Caused by: java.lang.RuntimeException:
com.ctc.wstx.exc.WstxParsingException: Un
declared general entity "uuml"
 at [row,col {unknown-source}]: [26,42]
        at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
athRecordReader.java:187)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor$2.run(XPathEn
tityProcessor.java:427)
Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
entity "uum
l"
 at [row,col {unknown-source}]: [26,42]
        at
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.jav
a:630)
        at
com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:467)

        at
com.ctc.wstx.sr.BasicStreamReader.handleUndeclaredEntity(BasicStreamR
eader.java:5431)
        at
com.ctc.wstx.sr.StreamScanner.expandUnresolvedEntity(StreamScanner.ja
va:1661)
        at
com.ctc.wstx.sr.StreamScanner.expandEntity(StreamScanner.java:1555)
        at
com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1
523)
        at
com.ctc.wstx.sr.BasicStreamReader.skipTokenText(BasicStreamReader.jav
a:3568)
        at
com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:33
42)
        at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java
:2622)
        at
com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
Element(XPathRecordReader.java:376)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
RecordReader.java:310)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
Element(XPathRecordReader.java:346)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
RecordReader.java:310)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
Element(XPathRecordReader.java:346)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
RecordReader.java:310)
        at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$200(
XPathRecordReader.java:202)
        at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
athRecordReader.java:184)
        ... 1 more

05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback


Mime
View raw message