manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Rhodes <>
Subject Re: How to determine the set of all possible fields in MCF output?
Date Tue, 24 Oct 2017 01:19:11 GMT
FWIW, I now understand what I was missing that made me think Manifold
was running TIka when it wasn't.  It turns out that Alfresco uses Tika
internally and when you get a document from Alfresco (using the
Webscripts connector anyway) the set of fields you get includes all
the image metadata and what-not (for image files).  I never realized
this because I don't typically use Alfresco for images.  But when I
added extra logging to the Alfresco WebScripts connector code, to spit
out the incoming field set, I see things like:

Found property exif:yResolution = 72.0
Found property cm:owner = admin
Found property exif:isoSpeedRatings = 400
Found property exif:fNumber = 3.5
Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b
Found property exif:pixelYDimension = 2048
Found property exif:resolutionUnit = Inch
Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z
Found property sys:locale = en_GB

which explains why the Solr connector was trying to save fields like
exif_fNumber and exif_resolutionUnit.   This came up because the
Alfresco instance I'm experimenting with has their default sample
workspace which includes images and things I don't normally touch.

As for managing all this so my history doesn't contain all those
failure messages, I thought about creating a "WhitelistFieldTransform"
as a transform connection to drop any fields other than the ones that
are whitelisted.    Two questions:

1. Does this seem like a reasonable approach, or is there a better way?

2. If this is reasonable and I create such a filter, would there be
any interest in having it contributed back to MCF?



This message optimized for indexing by NSA PRISM

On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <> wrote:
> Hi Phil,
> In most cases you can't modify the fields being output by the various
> connectors, but you don't have to use them.  If you have an output connector
> that *insists* on using all of them in a destructive way, we'd like to know
> about that.  Usually extra fields are harmless and only the ones you want in
> your schema are looked for.
> Karl
> On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <>
> wrote:
>> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <> wrote:
>> > Hi Phil,
>> >
>> > You are correct in asserting that in MCF it is the sum total of all the
>> > connections that the document passes through that determine its
>> > attribute
>> > set.  That includes transformation connections as well as the repository
>> > connection.
>> OK, sounds good.
>> > Tika is one connection that does add a lot of fields and these depend
>> > not
>> > only on the configuration of the Tika connection, but also on the kind
>> > of
>> > document being extracted.  If you want to figure out the sum total of
>> > what's
>> > possible, you will need to consult the Tika documentation.  And yes, the
>> > field names Tika generates are created based on what Tika finds in the
>> > document.
>> Gotcha.   So if I want to limit the fields output to *only* a specific
>> set that is determined in advance, is there a way to accomplish that?
>> > Alternatively, you can configure your job to send output to a null
>> > output
>> > connection.  This connection records all attribute information for each
>> > document in the simple history, so you can get an idea what to expect.
>> Excellent, I'll investigate that.
>> > I'm a little confused about your statement that Tika runs even when it's
>> > not
>> > in a job's pipeline.  That's not actually true, so I'm wondering what
>> > you
>> > are seeing.
>> It's probable that I'm wrong.  I just thought maybe there was some
>> default behavior, because I pointed MCF at a directory full of PDF's
>> without explicitly configuring Tika and I saw fields in the output
>> that I thought were probably generated by Tika.  Likewise now I am
>> running a pipeline with no explicit Tika step and I see output fields
>> for EXIF stuff for images and the like, which I assumed came from
>> Tika.
>> Phil

View raw message