lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Oddity with importing documents...
Date Fri, 06 May 2016 15:31:03 GMT
Shawn'e spot on in identifying your problem I think.

Actually, I'm not sure what happens if you just replace the Tika jars
in Solr. I actually doubt it'd work, but it _might_.

Personally I'm not a great fan of using SolrCell in production,
you're putting all the work on the Solr sever that's also indexing
and serving queries. With what's actually not very much effort
you can use Java/SolrJ to parse the docs on as many
clients as you want and send them to Solr, see:
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick


On Fri, May 6, 2016 at 7:20 AM, Shawn Heisey <apache@elyograg.org> wrote:
> On 5/6/2016 6:38 AM, Betsey Benagh wrote:
>> Since it appears that using a recent version of Tika with Solr is not really feasible,
I'm trying to run Grobid on my files, and then import the
>> corresponding XML into Solr.
>>
>> I don't see any errors on the post:
>>
>> bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml
>> /Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java
>> -classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar
>> -Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool
>> /Users/bba0124/software/grobid/out/021002_1.tei.xml
>> SimplePostTool version 5.0.0
>> Posting files to [base] url http://localhost:8983/solr/lrdtest/update...
>> Entering auto mode. File endings considered are
>> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r
>> tf,htm,html,txt,log
>> POSTing file 021002_1.tei.xml (application/xml) to [base]
>> 1 files indexed.
>> COMMITting Solr index changes to
>> http://localhost:8983/solr/lrdtest/update...
>> Time spent: 0:00:00.027
>>
>> But the documents don't seem to show up in the index, either.
>>
>>
>> Additionally, if I try uploading the documents using the web UI, they
>> appear to upload successfully,
>>
>> Response:{
>>   "responseHeader": {
>>     "status": 0,
>>     "QTime": 7
>>   }
>> }
>>
>> But aren't in the index.
>>
>> What am I missing?
>
> The way that you have used bin/post assumes that the XML is in the Solr
> xml update format.  Is your XML file in that format, or is it something
> else generated by Tika?  A 'bad' XML file will not necessarily throw an
> error, it might simply be ignored because it does not contain any
> actions for Solr to process.
>
> https://wiki.apache.org/solr/UpdateXmlMessages
>
> If it's some other kind of XML data generated by Tika, then I am not
> sure what you need to do in order to get the information into Solr.
> Perhaps it needs to be sent through the /update/extract handler (instead
> of /update), or maybe you will need to use DIH to run it through the
> XPathEntityProcessor.
>
> Thanks,
> Shawn
>

Mime
View raw message