manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Rhodes <motley.crue....@gmail.com>
Subject How to determine the set of all possible fields in MCF output?
Date Sat, 14 Oct 2017 22:39:16 GMT
Hi all, I've been working with MCF the past few days and am very happy
with what it lets me do, and I have a pipeline going from my
repository to Solr which works fine.  But there is one point I clearly
don't understand, which is:

How do you know exactly what fields are going to be output in a given
configuration?  I found that i had to resort to trial and error to
tweak my Solr schema to avoid "undefined field xxxxx" errors from
Manifold when trying to write to Solr.  Now to be fair, clearly I
could just ignore any fields I don't specifically know I want, but I'd
like to understand how this works.

Is it the case that the initial set of fields depends on the
repository connector?  I found that I seemed to get some Alfresco
specific stuff when reading from Alfresco, as opposed to what I got
from a simple dummy file-system repo I was initially experimenting
with.

It also seems that Tika adds some fields, (actually a lot of fields)
even when you don't have a Tika transform wired in explicitly?   Is it
the case that you need to put in an explicit Tika transform if you
want to control which fields are contributed by Tika?

And on that point, is there a master list of possible fields that TIka
will emit, or is Tika just transforming the names of metadata fields
in the documents it encounters, and programmatically generating a
field name?


Any and all help on understanding how this works is greatly appreciated...


Phil
~~~~
This message optimized for indexing by NSA PRISM

Mime
View raw message