manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: How to determine the set of all possible fields in MCF output?
Date Sun, 15 Oct 2017 14:11:29 GMT
Hi Phil,

In most cases you can't modify the fields being output by the various
connectors, but you don't have to use them.  If you have an output
connector that *insists* on using all of them in a destructive way, we'd
like to know about that.  Usually extra fields are harmless and only the
ones you want in your schema are looked for.


On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <>

> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <> wrote:
> > Hi Phil,
> >
> > You are correct in asserting that in MCF it is the sum total of all the
> > connections that the document passes through that determine its attribute
> > set.  That includes transformation connections as well as the repository
> > connection.
> OK, sounds good.
> > Tika is one connection that does add a lot of fields and these depend not
> > only on the configuration of the Tika connection, but also on the kind of
> > document being extracted.  If you want to figure out the sum total of
> what's
> > possible, you will need to consult the Tika documentation.  And yes, the
> > field names Tika generates are created based on what Tika finds in the
> > document.
> Gotcha.   So if I want to limit the fields output to *only* a specific
> set that is determined in advance, is there a way to accomplish that?
> > Alternatively, you can configure your job to send output to a null output
> > connection.  This connection records all attribute information for each
> > document in the simple history, so you can get an idea what to expect.
> Excellent, I'll investigate that.
> > I'm a little confused about your statement that Tika runs even when it's
> not
> > in a job's pipeline.  That's not actually true, so I'm wondering what you
> > are seeing.
> It's probable that I'm wrong.  I just thought maybe there was some
> default behavior, because I pointed MCF at a directory full of PDF's
> without explicitly configuring Tika and I saw fields in the output
> that I thought were probably generated by Tika.  Likewise now I am
> running a pipeline with no explicit Tika step and I see output fields
> for EXIF stuff for images and the like, which I assumed came from
> Tika.
> Phil

View raw message