lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sokolov <soko...@ifactory.com>
Subject Re: Problem while indexing XML file with special characters represented &uuml
Date Tue, 10 Jul 2012 16:42:18 GMT
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't 
use a true XML parser?

You might want to try passing your documents through "xmllint -noent" 
(basically parse and reserialize) - that should inline the characters as 
UTF-8?

On 07/09/2012 03:18 PM, Michael Belenki wrote:
> Somebody any idea? Solr seems to ignore the DTD definition and therefore
> does not understand the entities like&uuml; or&auml; that are defined in
> dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
> definition?
>
> On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki<vawi@belenki.name>
> wrote:
>    
>> Dear community,
>>
>> I am experiencing strange problem while trying to index / to import XML
>> document to SOLR via DataImportHandler. The XML document contains some
>> special characters (e.g. german ü) that are represented as XML entities
>> ü or ä. There is also DTD file that defines these entities
>> (<!ENTITY uuml    "ü">) (I tried to use dtd file as well as to
>> include the DTD definition to the xml itself). After I start the import
>> command full-import, the import process throws an exception as soon as
>>      
> it
>    
>> tries to parse ü: "Un
>> declared general entity "uuml". Did anyone already face such a problem?
>>
>> best regards,
>>
>> Michael
>>
>>
>> My data-config for importing is:
>>
>>
>> <dataConfig>
>>          <dataSource type="FileDataSource" encoding="ISO-8859-1" />
>>          <document>
>> 		<!--  stream should be true since huge xml document is being parsed
>>      
> -->
>    
>>          <entity name="article"
>>                  processor="XPathEntityProcessor"
>>                  stream="true"
>>                  forEach="/dblp/article"
>>                  url="documents/dblp.xml"
>>
>>                  >
>>              <field column="key"        xpath="/dblp/article/@key" />
>>              <field column="title"     xpath="/dblp/article/title" />
>>
>>
>>         </entity>
>>          </document>
>> </dataConfig>
>>
>> The XML file looks e.g. like this:
>>
>> <?xml version="1.0" encoding="ISO-8859-1"?>
>>
>> <!DOCTYPE dblp [
>>
>>      <!ENTITY uuml    "ü"><!-- small u, dieresis or umlaut mark -->
>> ]>
>> <dblp>
>>
>> <article key="journals/fm/Riccardi09" mdate="2011-10-27">
>> <author>Marco Riccardi</author>
>> <title>Solution of Cubic and Quartic Equations.ü</title>
>> <pages>117-122</pages>
>> <year>2009</year>
>> <volume>17</volume>
>>
>> <journal>Formalized Mathematics</journal>
>>
>> <number>1-4</number>
>>
>>      
> <ee>http://dx.doi.org/10.2478/v10037-009-0012-z</ee><url>db/journals/fm/fm17.html#Riccardi09</url>
>    
>> </article></dblp>
>>
>> The stack-trace is:
>>
>> 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
>> finish
>> INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
>> 05.07.2012 17:37:19 org.apache.solr.common.SolrException log
>> SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
>> java.lang.RuntimeE
>> xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
>> Parsing
>>   failed for xml, url:documents/dblp.xml rows processed in this xml:2
>>      
> last
>    
>> row in
>>   this xml:{title=Common Subexpression Identification in General
>>      
> Algebraic
>    
>> System
>> s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>> :264)
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>> rter.java:375)
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>> ava:445)
>>          at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
>> va:426)
>> Caused by: java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataIm
>> portHandlerException: Parsing failed for xml, url:documents/dblp.xml
>>      
> rows
>    
>> proces
>> sed in this xml:2 last row in this xml:{title=Common Subexpression
>> Identificatio
>> n in General Algebraic Systems., $forEach=/dblp/article,
>> key=persons/Hall74} Pro
>> cessing Document # 3
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:621)
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>> ava:327)
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>> :225)
>>          ... 3 more
>> Caused by:
>>      
> org.apache.solr.handler.dataimport.DataImportHandlerException:
>    
>> Parsin
>> g failed for xml, url:documents/dblp.xml rows processed in this xml:2
>>      
> last
>    
>> row i
>> n this xml:{title=Common Subexpression Identification in General
>>      
> Algebraic
>    
>> Syste
>> ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
>>          at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
>> Throw(DataImportHandlerException.java:72)
>>          at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
>> ntityProcessor.java:504)
>>          at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
>> ntityProcessor.java:517)
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
>> ProcessorBase.java:120)
>>          at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
>> XPathEntityProcessor.java:225)
>>          at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath
>> EntityProcessor.java:204)
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent
>> ityProcessorWrapper.java:330)
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:296)
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:683)
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:619)
>>          ... 5 more
>> Caused by: java.lang.RuntimeException:
>> com.ctc.wstx.exc.WstxParsingException: Un
>> declared general entity "uuml"
>>   at [row,col {unknown-source}]: [26,42]
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
>> athRecordReader.java:187)
>>          at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor$2.run(XPathEn
>> tityProcessor.java:427)
>> Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
>> entity "uum
>> l"
>>   at [row,col {unknown-source}]: [26,42]
>>          at
>> com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.jav
>> a:630)
>>          at
>> com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:467)
>>
>>          at
>> com.ctc.wstx.sr.BasicStreamReader.handleUndeclaredEntity(BasicStreamR
>> eader.java:5431)
>>          at
>> com.ctc.wstx.sr.StreamScanner.expandUnresolvedEntity(StreamScanner.ja
>> va:1661)
>>          at
>> com.ctc.wstx.sr.StreamScanner.expandEntity(StreamScanner.java:1555)
>>          at
>> com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1
>> 523)
>>          at
>> com.ctc.wstx.sr.BasicStreamReader.skipTokenText(BasicStreamReader.jav
>> a:3568)
>>          at
>> com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:33
>> 42)
>>          at
>> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java
>> :2622)
>>          at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
>> Element(XPathRecordReader.java:376)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
>> RecordReader.java:310)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
>> Element(XPathRecordReader.java:346)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
>> RecordReader.java:310)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart
>> Element(XPathRecordReader.java:346)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath
>> RecordReader.java:310)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$200(
>> XPathRecordReader.java:202)
>>          at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP
>> athRecordReader.java:184)
>>          ... 1 more
>>
>> 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
>> INFO: start rollback
>> 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback
>> INFO: end_rollback
>>      

Mime
View raw message