lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1358) Integration of Solr Cell and DataImportHandler
Date Thu, 03 Sep 2009 07:07:32 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750855#action_12750855
] 

Noble Paul commented on SOLR-1358:
----------------------------------

Let us provide a new TikaEntityProcessor 

{code:xml}
<entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}">
</entity>
{code}

This most likely would need a BinUrlDataSource/BinContentStreamDataSource because Tika uses
binary inputs.

My suggestion is that TikaEntityProcessor live in the extraction contrib so that managing
dependencies is easier. But we will have to make extraction have a compile-time dependency
on DIH. 

Grant , what do you think?

> Integration of Solr Cell and DataImportHandler
> ----------------------------------------------
>
>                 Key: SOLR-1358
>                 URL: https://issues.apache.org/jira/browse/SOLR-1358
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: Sascha Szott
>
> At the moment, it's impossible to configure Solr such that it build up documents by using
data that comes from both pdf documents and database table columns. Currently, to accomplish
this task, it's up to the user to add some preprocessing that converts pdf files into plain
text files. Therefore, I would like to see an integration of Solr Cell into DIH that makes
those preprocessing obsolete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message