lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-7139) ExtractingRequestHandler default solrconfig.xml ignores div tags which breaks TikaOCR
Date Sun, 22 Feb 2015 03:57:12 GMT
Chris A. Mattmann created SOLR-7139:
---------------------------------------

             Summary: ExtractingRequestHandler default solrconfig.xml ignores div tags which
breaks TikaOCR
                 Key: SOLR-7139
                 URL: https://issues.apache.org/jira/browse/SOLR-7139
             Project: Solr
          Issue Type: Bug
          Components: contrib - Solr Cell (Tika extraction)
            Reporter: Chris A. Mattmann
            Priority: Critical
             Fix For: 4.10.4


While testing my large scale Tika/SolrCell indexing (great work on /extraction guys, really
really appreciate it) on my 40M image dataset, I was pulling my frickin' hair out trying to
figure out why the TesseractOCR extracted content wasn't actually making it into the index.
Well I figured it out lol (many many System.out.printlns later) - it's the disabling of div
tags (=>ignored) in the default solrconfig.xml. This basically renders TesseractOCR output
in SolrCell useless since it is surrounded by a div tag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message