lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe (JIRA) <>
Subject [jira] Commented: (SOLR-1526) Client Side Tika integration
Date Sun, 19 Dec 2010 23:19:01 GMT


Tomás Fernández Löbbe commented on SOLR-1526:

I have a possible implementation for this jira. I created a class SolrFileInputDocument that
extends SolrInputDocument, the main difference is that it contains the methods:

public void addFile(InputStream file)


public void addFile(InputStream file , Metadata metadata)

This two methods will use Tika to extract the content and will end up creating fields (this.addField(...))
of the parent class SolrInputDocument. The SolrFileInputDocument accepts a Map instance to
map the extracted metadata to a Solr field, something like this:

		Map<String, String> map = new HashMap<String, String>();
		map.put("content", "text");
		map.put("keywords", "cat");
		map.put("creator", "manu");
		SolrFileInputDocument document = new  SolrFileInputDocument(map);

I added the classes to another "contrib" directory, I don't know if this should be done this
way, I just didn't want to add a dependency with Tika that might be not always needed.  Adding
this code to a client application would require to add the SolrJ jar plus the "clientextraction"

I still haven't done anything to keep  the "prefix" feature of the ExtractingRequestHandler
(which I don't think is going to be difficult) and I'm still don't manage non text fields
like dates, but I could do it if you think this is a good approach.

Do you think this could work? I can upload the code tomorrow.

> Client Side Tika integration
> ----------------------------
>                 Key: SOLR-1526
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Grant Ingersoll
>            Priority: Minor
>             Fix For: Next
> Often times it is cost prohibitive to send full, rich documents over the wire.  The contrib/extraction
library has server side integration with Tika, but it would be nice to have a client side
implementation as well.  It should support both metadata and content or just metadata.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message