I attached a patch to CONNECTORS-1209.  I have not tested it yet.  Hopefully there will be time to do that later in the weekend.


On Fri, Jun 5, 2015 at 10:03 AM, Karl Wright <daddywri@gmail.com> wrote:
Created CONNECTORS-1209 for this functionality.

It's not hard to do, technically, but I need to define a language to describe the regex and what you would want to extract.  For instance, right now you specify a field value in terms of another field value like this:


I'd be putting additional specification into ${otherfieldname}, something like this:


... which would extract the first number from the metadata value.  But since ":" may well be part of a field name right now, I'd need to do something other than that, and I'd want to be able to support more complex regexps as well.


On Fri, Jun 5, 2015 at 9:33 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Vigi,

I do understand your issue, but I'd propose a general solution of adding new functionality to the Metadata Transformer to achieve your goal.  So the setup would be this:

- Use the JCIFS connector Metadata tab to just include the entire path in the metadata
- Use the Metadata Transformer to generate two different pieces of metadata, using a new regular expression modification feature that I would write for you, if we can come up with a design for it

You can write your own completely new transformation connector, but that's no different than what I propose, and not as useful.


On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
Dear Karl,

Maybe I misunderstood the applications for the metadata tab but in my scenario I need to extract two types of information from a document's path. Right now I am only able to extract one piece of information and put it in Solr; it would have been very useful to be able to perform other transformations to the paths but it's OK, I can probably write a transformation connector of my own.


Date: Fri, 5 Jun 2015 09:02:59 -0400
Subject: Re: Job definition metadata with multiple path attribute names
From: daddywri@gmail.com
To: user@manifoldcf.apache.org

Hi Vigi,

You get, for free, the file name of the document as metadata, from all repository connectors, including the jcifs connector:


The problem is that this is not something you can manipulate in MCF via regular expression with the current bevy of supplied transformation connectors, because (a) it isn't generic metadata but a fixed property of the document, and (b) the Metadata Transformer connector doesn't allow you to slice and dice metadata in any case, just compose it into bigger strings.

So you're stuck with either writing a document transformation connector of your own, which does what you want, or proposing additional functionality for the Metadata Transformer.  If it can be done in a backwards compatible way, this is something I would support.

I'm not thrilled with the idea of extending the JCIFS connector to build multiple independent attributes all from the path; the UI for this connector is already quite complex, and the functionality for generically manipulating metadata would be useful in general anyway.


On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
Hello guys,

I have another Manifoldcf 2.0.2 question. Our process consists of indexing some documents from a Windows Share and sending them to Solr. I would like to extract some information from the documents and put it into specific Solr fields. For example, based on the id of the document I am currently extracting a specific folder name (using regular expressions on the metadata tab of the job defintition) and storing it into Solr; this it works fine.

However, I also want to extract the file extension (using regex) and send it to Solr but I am not able to add more than one path attribute name on the Metadata tab of the job definition. I already have one that extracts a particular folder name from the file path and I would need a second one for the file extension.

How would I be able to achieve this?