lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-1358) Integration of Tika and DataImportHandler
Date Wed, 09 Dec 2009 04:49:18 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750855#action_12750855
] 

Noble Paul edited comment on SOLR-1358 at 12/9/09 4:48 AM:
-----------------------------------------------------------

Let us provide a new TikaEntityProcessor 

{code:xml}
<dataConfig>
 <!-- use any of type DataSource<InputStream> --> 
  <dataSource type="BinURLDataSource"/>
  <document>
   <!-- The value of format can be text|xml|html|none. this is the format in which the
body is emited (the 'text' field) . The implicit field 'text' will have that format.
          default value is 'text'  (if not specified) . format="none" means body is not emited-->
    <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}"
format="text">
      <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
      <field column="text"/>
     </entity>
  <document>
</dataConfig>
{code}

With format=xml|html XPathEntityProcessor can be nested. This may help users extract more
nested data from a file. It is even possible to create multiple documents from a single file

      was (Author: noble.paul):
    Let us provide a new TikaEntityProcessor 

{code:xml}
<dataConfig>
 <!-- use any of type DataSource<InputStream> --> 
  <dataSource type="BinURLDataSource"/>
  <document>
   <!-- The value of format can be text|xml|html . The implicit field 'text' will have
that format.
          default value is 'text'  (if not specified) -->
    <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}"
format="text">
      <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
      <field column="text"/>
     </entity>
  <document>
</dataConfig>
{code}

With format=xml|html XPathEntityProcessor can be nested. This may help users extract more
nested data from a file. It is even possible to create multiple documents from a single file
  
> Integration of Tika and DataImportHandler
> -----------------------------------------
>
>                 Key: SOLR-1358
>                 URL: https://issues.apache.org/jira/browse/SOLR-1358
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: Sascha Szott
>            Assignee: Noble Paul
>         Attachments: SOLR-1358.patch, SOLR-1358.patch, SOLR-1358.patch
>
>
> At the moment, it's impossible to configure Solr such that it build up documents by using
data that comes from both pdf documents and database table columns. Currently, to accomplish
this task, it's up to the user to add some preprocessing that converts pdf files into plain
text files. Therefore, I would like to see an integration of Solr Cell into DIH that makes
those preprocessing obsolete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message