lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Oddity with importing documents...
Date Fri, 06 May 2016 14:20:46 GMT
On 5/6/2016 6:38 AM, Betsey Benagh wrote:
> Since it appears that using a recent version of Tika with Solr is not really feasible,
I'm trying to run Grobid on my files, and then import the
> corresponding XML into Solr.
>
> I don't see any errors on the post:
>
> bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml
> /Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java
> -classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar
> -Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool
> /Users/bba0124/software/grobid/out/021002_1.tei.xml
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:8983/solr/lrdtest/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r
> tf,htm,html,txt,log
> POSTing file 021002_1.tei.xml (application/xml) to [base]
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/lrdtest/update...
> Time spent: 0:00:00.027
>
> But the documents don't seem to show up in the index, either.
>
>
> Additionally, if I try uploading the documents using the web UI, they
> appear to upload successfully,
>
> Response:{
>   "responseHeader": {
>     "status": 0,
>     "QTime": 7
>   }
> }
>
> But aren't in the index.
>
> What am I missing?

The way that you have used bin/post assumes that the XML is in the Solr
xml update format.  Is your XML file in that format, or is it something
else generated by Tika?  A 'bad' XML file will not necessarily throw an
error, it might simply be ignored because it does not contain any
actions for Solr to process.

https://wiki.apache.org/solr/UpdateXmlMessages

If it's some other kind of XML data generated by Tika, then I am not
sure what you need to do in order to get the information into Solr. 
Perhaps it needs to be sent through the /update/extract handler (instead
of /update), or maybe you will need to use DIH to run it through the
XPathEntityProcessor.

Thanks,
Shawn


Mime
View raw message