lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Using Tika that comes with Solr 5.2
Date Wed, 03 Feb 2016 19:34:09 GMT
>Be aware, though, that some parser dependencies are not included with the Solr distribution,
and, because of the way that Tika currently works, you'll silently >get no text/metadata
from those file types (e.g. sqlite files and others).  See [1] for some discussion of this.
 If you want the full Tika (with all of its messiness) >and you are already using SolrJ,
use the tika-app.jar.

Correction, just realized that is mostly true.  We aren't packaging the sqlite jar any more
in Tika-app (for the same reason that Solr doesn't -- native libs), you'll have to grab that
and add it to your class path. :)

See also, very recently: https://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3C027601d15ea8%2443ffcf90%24cbff6eb0%24%40thetaphi.de%3E


-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Wednesday, February 03, 2016 7:35 AM
To: solr-user@lucene.apache.org
Subject: RE: Using Tika that comes with Solr 5.2

Right.  Thank you for reporting the solution.  

Be aware, though, that some parser dependencies are not included with the Solr distribution,
and, because of the way that Tika currently works, you'll silently get no text/metadata from
those file types (e.g. sqlite files and others).  See [1] for some discussion of this.  If
you want the full Tika (with all of its messiness) and you are already using SolrJ, use the
tika-app.jar.

Your code will correctly extract content from embedded documents, but it will not extract
metadata from embedded documents/attachments (SOLR-7229).  If you want to be able to process
metadata from embedded docs, you might consider the RecursiveParserWrapper.

Note, too, that if you send in a ParseContext (SOLR-7189) in your call to parse, make sure
to add the AutoDetectParser or else you will get no content from embedded docs.

Both of these will get embedded content:

parser.parse(in, contentHandler, metadata);

Or

ParseContext context = new ParseContext(); context.set(Parser.class, parser); parser.parse(in,
contentHandler, metadata, context);

This will not:
ParseContext context = new ParseContext(); parser.parse(in, contentHandler, metadata, context);


As you've already done, feel free to ask more Tika-specific questions over on tika-user.

Cheers,

           Tim

[1] https://issues.apache.org/jira/browse/TIKA-1511?focusedCommentId=14385803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14385803

-----Original Message-----
From: Steven White [mailto:swhite4141@gmail.com]
Sent: Tuesday, February 02, 2016 7:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Tika that comes with Solr 5.2

I found my issue.  I need to include JARs off: \solr\contrib\extraction\lib\

Steve

On Tue, Feb 2, 2016 at 4:24 PM, Steven White <swhite4141@gmail.com> wrote:

> I'm not using solr-app.jar.  I need to stick with Tika JARs that come 
> with Solr 5.2 and yet get the full text extraction feature of Tika 
> (all file types it supports).
>
> At first, I started to include Tika JARs as needed; I now have all 
> Tika related JARs that come with Solr and yet it is not working.  Here 
> is the
> list: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, 
> tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, 
> kite-morphlines-tika-core-0.12.1.jar
> and kite-morphlines-tika-decompress-0.12.1.jar.  As part of my 
> program, I also have SolrJ JARs and their dependency:
> solr-solrj-5.2.1.jar, solr-core-5.2.1.jar, etc.
>
> You said "Might not have the parsers on your path within your Solr 
> framework?".  I"m using Tika outside Solr framework.  I'm trying to 
> use Tika from my own crawler application that uses SojrJ to send the 
> raw text to Solr for indexing.
>
> What is it that I am missing?!
>
> Steve
>
> On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B. 
> <tallison@mitre.org>
> wrote:
>
>> Might not have the parsers on your path within your Solr framework?
>>
>> Which tika jars are on your path?
>>
>> If you want the functionality of all of Tika, use the standalone 
>> tika-app.jar, but do not use the app in the same JVM as 
>> Solr...without a custom class loader.  The Solr team carefully prunes 
>> the dependencies when integrating Tika and makes sure that the main parsers _just
work_.
>>
>>
>> -----Original Message-----
>> From: Steven White [mailto:swhite4141@gmail.com]
>> Sent: Tuesday, February 02, 2016 2:53 PM
>> To: solr-user@lucene.apache.org
>> Subject: Using Tika that comes with Solr 5.2
>>
>> Hi,
>>
>> I'm trying to use Tika that comes with Solr 5.2.  The following code 
>> is not
>> working:
>>
>> public static void parseWithTika() throws Exception {
>>     File file = new File("C:\\temp\\test.pdf");
>>
>>     FileInputStream in = new FileInputStream(file);
>>     AutoDetectParser parser = new AutoDetectParser();
>>     Metadata metadata = new Metadata();
>>     metadata.add(Metadata.RESOURCE_NAME_KEY, file.getName());
>>     BodyContentHandler contentHandler = new BodyContentHandler();
>>
>>     parser.parse(in, contentHandler, metadata);
>>
>>     String content = contentHandler.toString();   <=== 'content' is always
>> empty
>>
>>     in.close();
>> }
>>
>> 'content' is always empty string unless when the file I pass to Tika 
>> is a text file.  Any idea what's the issue?
>>
>> I have also tried sample codes off
>> https://tika.apache.org/1.8/examples.html
>> with the same result.
>>
>>
>> Thanks !!
>>
>> Steve
>>
>
>
Mime
View raw message