lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: Oddity with importing documents...
Date Fri, 06 May 2016 14:20:46 GMT
On 5/6/2016 6:38 AM, Betsey Benagh wrote:
> Since it appears that using a recent version of Tika with Solr is not really feasible,
I'm trying to run Grobid on my files, and then import the
> corresponding XML into Solr.
> I don't see any errors on the post:
> bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml
> /Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java
> -classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar
> -Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool
> /Users/bba0124/software/grobid/out/021002_1.tei.xml
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:8983/solr/lrdtest/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r
> tf,htm,html,txt,log
> POSTing file 021002_1.tei.xml (application/xml) to [base]
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/lrdtest/update...
> Time spent: 0:00:00.027
> But the documents don't seem to show up in the index, either.
> Additionally, if I try uploading the documents using the web UI, they
> appear to upload successfully,
> Response:{
>   "responseHeader": {
>     "status": 0,
>     "QTime": 7
>   }
> }
> But aren't in the index.
> What am I missing?

The way that you have used bin/post assumes that the XML is in the Solr
xml update format.  Is your XML file in that format, or is it something
else generated by Tika?  A 'bad' XML file will not necessarily throw an
error, it might simply be ignored because it does not contain any
actions for Solr to process.

If it's some other kind of XML data generated by Tika, then I am not
sure what you need to do in order to get the information into Solr. 
Perhaps it needs to be sent through the /update/extract handler (instead
of /update), or maybe you will need to use DIH to run it through the


View raw message