manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: How to determine the set of all possible fields in MCF output?
Date Tue, 24 Oct 2017 07:31:02 GMT
Hi Phil,

Solr will certainly skip any fields that it doesn't know about and simply
not save them.  There's little cost to having them pass through MCF; the
big cost is extraction, which you're stuck with because Alfresco does it no
matter what.  So I'm not sure what a white-list transformer does for you.

But in any case, there's already a transformer that allows you to map
metadata around -- the Metadata Adjuster.  See:

http://manifoldcf.apache.org/release/release-2.8.1/en_US/end-user-documentation.html#metadataadjuster

This transformer maps metadata values, allows you to insert new ones, and
also allows you to ONLY pass through the ones that are explicitly specified
if you wish.

Thanks,
Karl


On Mon, Oct 23, 2017 at 9:19 PM, Phillip Rhodes <motley.crue.fan@gmail.com>
wrote:

> FWIW, I now understand what I was missing that made me think Manifold
> was running TIka when it wasn't.  It turns out that Alfresco uses Tika
> internally and when you get a document from Alfresco (using the
> Webscripts connector anyway) the set of fields you get includes all
> the image metadata and what-not (for image files).  I never realized
> this because I don't typically use Alfresco for images.  But when I
> added extra logging to the Alfresco WebScripts connector code, to spit
> out the incoming field set, I see things like:
>
> Found property exif:yResolution = 72.0
> Found property cm:owner = admin
> Found property exif:isoSpeedRatings = 400
> Found property exif:fNumber = 3.5
> Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b
> Found property exif:pixelYDimension = 2048
> Found property exif:resolutionUnit = Inch
> Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z
> Found property sys:locale = en_GB
>
> which explains why the Solr connector was trying to save fields like
> exif_fNumber and exif_resolutionUnit.   This came up because the
> Alfresco instance I'm experimenting with has their default sample
> workspace which includes images and things I don't normally touch.
> :-)
>
> As for managing all this so my history doesn't contain all those
> failure messages, I thought about creating a "WhitelistFieldTransform"
> as a transform connection to drop any fields other than the ones that
> are whitelisted.    Two questions:
>
> 1. Does this seem like a reasonable approach, or is there a better way?
>
> 2. If this is reasonable and I create such a filter, would there be
> any interest in having it contributed back to MCF?
>
>
> Cheers,
>
>
> Phil
>
> This message optimized for indexing by NSA PRISM
>
>
> On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <daddywri@gmail.com> wrote:
> > Hi Phil,
> >
> > In most cases you can't modify the fields being output by the various
> > connectors, but you don't have to use them.  If you have an output
> connector
> > that *insists* on using all of them in a destructive way, we'd like to
> know
> > about that.  Usually extra fields are harmless and only the ones you
> want in
> > your schema are looked for.
> >
> > Karl
> >
> >
> > On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <
> motley.crue.fan@gmail.com>
> > wrote:
> >>
> >> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >> > Hi Phil,
> >> >
> >> > You are correct in asserting that in MCF it is the sum total of all
> the
> >> > connections that the document passes through that determine its
> >> > attribute
> >> > set.  That includes transformation connections as well as the
> repository
> >> > connection.
> >>
> >> OK, sounds good.
> >>
> >> > Tika is one connection that does add a lot of fields and these depend
> >> > not
> >> > only on the configuration of the Tika connection, but also on the kind
> >> > of
> >> > document being extracted.  If you want to figure out the sum total of
> >> > what's
> >> > possible, you will need to consult the Tika documentation.  And yes,
> the
> >> > field names Tika generates are created based on what Tika finds in the
> >> > document.
> >>
> >> Gotcha.   So if I want to limit the fields output to *only* a specific
> >> set that is determined in advance, is there a way to accomplish that?
> >>
> >> > Alternatively, you can configure your job to send output to a null
> >> > output
> >> > connection.  This connection records all attribute information for
> each
> >> > document in the simple history, so you can get an idea what to expect.
> >>
> >> Excellent, I'll investigate that.
> >>
> >> > I'm a little confused about your statement that Tika runs even when
> it's
> >> > not
> >> > in a job's pipeline.  That's not actually true, so I'm wondering what
> >> > you
> >> > are seeing.
> >>
> >> It's probable that I'm wrong.  I just thought maybe there was some
> >> default behavior, because I pointed MCF at a directory full of PDF's
> >> without explicitly configuring Tika and I saw fields in the output
> >> that I thought were probably generated by Tika.  Likewise now I am
> >> running a pipeline with no explicit Tika step and I see output fields
> >> for EXIF stuff for images and the like, which I assumed came from
> >> Tika.
> >>
> >>
> >>
> >> Phil
> >
> >
>

Mime
View raw message