lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Need help importing OOXML custom properties into Solr
Date Tue, 18 Mar 2014 09:13:23 GMT
Have you tried just using Tika directly and seeing what gets output?
Maybe it is all prefixed somehow. Or sending one file as a sample
directly to the extract handler and temporarily storing the ignored_*
dynamicField to see what actually happens?

Basically, check what is there before trying to figure out what is not
there. Sometimes it is faster in a multi-step chain of actions.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Mar 18, 2014 at 3:59 PM, Anders Gustafsson
<Anders.Gustafsson@pedago.fi> wrote:
> solr-spec 4.6.1
> lucene-spec 4.6.0
> lux-appserver 1.1.0
> tika 1.4
> poi 3.9
>
> Hi!
>
> I set it up, pretty much following the instructions at
> http://www.codewrecks.com/blog/index.php/2013/05/25/import-folder-of-documents-with-apache-solr-4-0-and-tika/
>
> Problem is that I cannot seem to import custom properties? Ie I created
> a word 2013 doc with a custom property called "Testmeta". It is visible
> in custom.xml if I open up the ooxml file in winzip. I then tried to map
> it for import in data-config.xml:
>
> <dataConfig>
>     <dataSource type="BinFileDataSource" />
>         <document>
>             <entity name="files" dataSource="null" rootEntity="false"
>             processor="FileListEntityProcessor"
>             baseDir="/tmp/docs" fileName=".*.(doc)|(pdf)|(docx)"
>             onError="skip"
>             recursive="true">
>                 <field column="fileAbsolutePath" name="lux_uri" />
>                 <field column="fileSize" name="size" />
>                 <field column="fileLastModified" name="lastModified"
> />
>
>                 <entity
>                     name="documentImport"
>                     processor="TikaEntityProcessor"
>                     url="${files.fileAbsolutePath}"
>                     format="text">
>                     <field column="file" name="fileName"/>
>                     <field column="Author" name="author" meta="true"/>
>                     <field column="title" name="title" meta="true"/>
>                     <field column="text" name="text"/>
>                     <field column="Testmeta" name="Testmeta"
> meta="true"/>
>                     <field column="LastModifiedBy"
> name="LastModifiedBy" meta="true"/>
>                 </entity>
>         </entity>
>         </document>
> </dataConfig>
>
> and schema.xml:
>
> <field name="Testmeta" type="text" indexed="true" stored="true" />
>
> Still I see no mention of the field when I do an import (below).
> According to https://issues.apache.org/jira/browse/TIKA-695 it should
> work. But I see no mention of any special config that needs to be done.
>
>
> Any help appreciated!
>
>   "mode": "debug",
>   "documents": [
>     {
>       "size": [
>         14516
>       ],
>       "lastModified": [
>         "2014-03-18T06:53:14Z"
>       ],
>       "lux_uri": [
>         "/tmp/docs/ff-1923-12.docx"
>       ],
>       "text": [
>         "Förordning ........."
>       ],
>       "title": [
>         "Förordning ........"
>       ],
>       "author": [
>         "Lagberedningen"
>       ],
>       "_version_": [
>         1462902187294195700
>       ]
>     }
>   ],
>
> --
> Anders Gustafsson
> Engineer, CNI, CNE6, ASE
> Pedago, The Aaland Islands (N60 E20)
> www.pedago.fi
> phone +358 18 12060
> mobile +358 40506 7099
>

Mime
View raw message