lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teague James" <teag...@insystechinc.com>
Subject DIH and Tika
Date Mon, 17 Feb 2014 21:23:05 GMT
Is there a way to specify the document types that Tika parses? In my DIH I
index the content of a SQL database which has a field that points to the SQL
record's binary file (which could be Word, PDF, JPG, MOV, etc.). Tika then
uses the document URL to index that document's content. However there are a
lot of document types that Tika cannot parse. I'd like to limit Tika to just
parsing Word and PDF documents so that I don't have to wait for Tika to
determine the document type and whether or not it can parse it. I suspect
that the number of exceptions being thrown over documents that Tika cannot
read is increasing my indexing time significantly. Any guidance is
appreciated.

-Teague


Mime
View raw message