manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job definition metadata with multiple path attribute names
Date Mon, 08 Jun 2015 11:15:20 GMT
Hi Vigi,

bq. I think the easiest would be to be able to define multiple mappings
from source metadata fields to destination metadata fields, using regular
expressions. Maybe there could be some other use cases besides regexes.
What you have right now on version 2.0.2 is very good, except that it only
allows one mapping.

The Metadata Transformation Connector patch allows for multiple mappings,
all different, to multiple destination fields.

bq. In fact, the most generic use case would be to be able to apply custom
transformations from the metadata fields provided by an input connector
into other output connector metadata fields.

That is exactly what the Metadata Transformation Connector does.

Karl



On Mon, Jun 8, 2015 at 7:02 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:

> Hello Karl,
>
> I think the easiest would be to be able to define multiple mappings from
> source metadata fields to destination metadata fields, using regular
> expressions. Maybe there could be some other use cases besides regexes.
> What you have right now on version 2.0.2 is very good, except that it only
> allows one mapping. Probably this sort of transformations could be useful
> for other type of repository connections as well.
>
> In fact, the most generic use case would be to be able to apply custom
> transformations from the metadata fields provided by an input connector
> into other output connector metadata fields.
>
> It would also be very useful to know somehow which are the available
> metadata fields on the connectors. I think I have already asked you about
> some details on the Tika transformation connector.
>
> Keep in touch,
> vigi
>
> ------------------------------
> Date: Sat, 6 Jun 2015 06:07:02 -0400
>
> Subject: Re: Job definition metadata with multiple path attribute names
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
>
> I attached a patch to CONNECTORS-1209.  I have not tested it yet.
> Hopefully there will be time to do that later in the weekend.
>
> Karl
>
>
> On Fri, Jun 5, 2015 at 10:03 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Created CONNECTORS-1209 for this functionality.
>
> It's not hard to do, technically, but I need to define a language to
> describe the regex and what you would want to extract.  For instance, right
> now you specify a field value in terms of another field value like this:
>
> stringstringstring${otherfieldname}stringstring
>
> I'd be putting additional specification into ${otherfieldname}, something
> like this:
>
> stringstringstring${otherfieldname:([1234567890]*)}stringstring
>
> ... which would extract the first number from the metadata value.  But
> since ":" may well be part of a field name right now, I'd need to do
> something other than that, and I'd want to be able to support more complex
> regexps as well.
>
> Karl
>
>
> On Fri, Jun 5, 2015 at 9:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi Vigi,
>
> I do understand your issue, but I'd propose a general solution of adding
> new functionality to the Metadata Transformer to achieve your goal.  So the
> setup would be this:
>
> - Use the JCIFS connector Metadata tab to just include the entire path in
> the metadata
> - Use the Metadata Transformer to generate two different pieces of
> metadata, using a new regular expression modification feature that I would
> write for you, if we can come up with a design for it
>
> You can write your own completely new transformation connector, but that's
> no different than what I propose, and not as useful.
>
> Thanks,
> Karl
>
>
>
> On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
>
> Dear Karl,
>
> Maybe I misunderstood the applications for the metadata tab but in my
> scenario I need to extract two types of information from a document's path.
> Right now I am only able to extract one piece of information and put it in
> Solr; it would have been very useful to be able to perform other
> transformations to the paths but it's OK, I can probably write a
> transformation connector of my own.
>
> Thanks,
> vigi
> ------------------------------
> Date: Fri, 5 Jun 2015 09:02:59 -0400
> Subject: Re: Job definition metadata with multiple path attribute names
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
>
>
> Hi Vigi,
>
> You get, for free, the file name of the document as metadata, from all
> repository connectors, including the jcifs connector:
>
> >>>>>>
>                   rd.setFileName(fileNameString);
> <<<<<<
>
> The problem is that this is not something you can manipulate in MCF via
> regular expression with the current bevy of supplied transformation
> connectors, because (a) it isn't generic metadata but a fixed property of
> the document, and (b) the Metadata Transformer connector doesn't allow you
> to slice and dice metadata in any case, just compose it into bigger strings.
>
> So you're stuck with either writing a document transformation connector of
> your own, which does what you want, or proposing additional functionality
> for the Metadata Transformer.  If it can be done in a backwards compatible
> way, this is something I would support.
>
> I'm not thrilled with the idea of extending the JCIFS connector to build
> multiple independent attributes all from the path; the UI for this
> connector is already quite complex, and the functionality for generically
> manipulating metadata would be useful in general anyway.
>
> Karl
>
>
> On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
>
> Hello guys,
>
> I have another Manifoldcf 2.0.2 question. Our process consists of indexing
> some documents from a Windows Share and sending them to Solr. I would like
> to extract some information from the documents and put it into specific
> Solr fields. For example, based on the id of the document I am currently
> extracting a specific folder name (using regular expressions on the
> metadata tab of the job defintition) and storing it into Solr; this it
> works fine.
>
> However, I also want to extract the file extension (using regex) and send
> it to Solr but I am not able to add more than one path attribute name on
> the Metadata tab of the job definition. I already have one that extracts a
> particular folder name from the file path and I would need a second one for
> the file extension.
>
> How would I be able to achieve this?
>
> Regards,
> vigi
>
>
>
>
>
>

Mime
View raw message