manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job definition metadata with multiple path attribute names
Date Mon, 08 Jun 2015 11:25:29 GMT
Hi Vigi,

You can read up on the Metadata Adjuster Transformation Connector here:

https://manifoldcf.apache.org/release/release-2.1/en_US/end-user-documentation.html#metadataadjuster

I've also just added the following to the documentation for it:
>>>>>>
                <p>You can also use regular expressions in the substitution
string, for example: "${there|[0-9]*}", which will extract the first
sequence of sequential numbers it finds in the
                      value of the field "there", or
"${there|string(*.)|1}", which will include everything following "string"
in the field value.  (The third argument specifies the regular
                      expression group number, with an optional suffix of
"l" or "u" meaning upper-case or lower-case.)</p>
<<<<<<

Karl




On Mon, Jun 8, 2015 at 7:15 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Vigi,
>
> bq. I think the easiest would be to be able to define multiple mappings
> from source metadata fields to destination metadata fields, using regular
> expressions. Maybe there could be some other use cases besides regexes.
> What you have right now on version 2.0.2 is very good, except that it only
> allows one mapping.
>
> The Metadata Transformation Connector patch allows for multiple mappings,
> all different, to multiple destination fields.
>
> bq. In fact, the most generic use case would be to be able to apply custom
> transformations from the metadata fields provided by an input connector
> into other output connector metadata fields.
>
> That is exactly what the Metadata Transformation Connector does.
>
> Karl
>
>
>
> On Mon, Jun 8, 2015 at 7:02 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
>
>> Hello Karl,
>>
>> I think the easiest would be to be able to define multiple mappings from
>> source metadata fields to destination metadata fields, using regular
>> expressions. Maybe there could be some other use cases besides regexes.
>> What you have right now on version 2.0.2 is very good, except that it only
>> allows one mapping. Probably this sort of transformations could be useful
>> for other type of repository connections as well.
>>
>> In fact, the most generic use case would be to be able to apply custom
>> transformations from the metadata fields provided by an input connector
>> into other output connector metadata fields.
>>
>> It would also be very useful to know somehow which are the available
>> metadata fields on the connectors. I think I have already asked you about
>> some details on the Tika transformation connector.
>>
>> Keep in touch,
>> vigi
>>
>> ------------------------------
>> Date: Sat, 6 Jun 2015 06:07:02 -0400
>>
>> Subject: Re: Job definition metadata with multiple path attribute names
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>>
>> I attached a patch to CONNECTORS-1209.  I have not tested it yet.
>> Hopefully there will be time to do that later in the weekend.
>>
>> Karl
>>
>>
>> On Fri, Jun 5, 2015 at 10:03 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Created CONNECTORS-1209 for this functionality.
>>
>> It's not hard to do, technically, but I need to define a language to
>> describe the regex and what you would want to extract.  For instance, right
>> now you specify a field value in terms of another field value like this:
>>
>> stringstringstring${otherfieldname}stringstring
>>
>> I'd be putting additional specification into ${otherfieldname}, something
>> like this:
>>
>> stringstringstring${otherfieldname:([1234567890]*)}stringstring
>>
>> ... which would extract the first number from the metadata value.  But
>> since ":" may well be part of a field name right now, I'd need to do
>> something other than that, and I'd want to be able to support more complex
>> regexps as well.
>>
>> Karl
>>
>>
>> On Fri, Jun 5, 2015 at 9:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Hi Vigi,
>>
>> I do understand your issue, but I'd propose a general solution of adding
>> new functionality to the Metadata Transformer to achieve your goal.  So the
>> setup would be this:
>>
>> - Use the JCIFS connector Metadata tab to just include the entire path in
>> the metadata
>> - Use the Metadata Transformer to generate two different pieces of
>> metadata, using a new regular expression modification feature that I would
>> write for you, if we can come up with a design for it
>>
>> You can write your own completely new transformation connector, but
>> that's no different than what I propose, and not as useful.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
>>
>> Dear Karl,
>>
>> Maybe I misunderstood the applications for the metadata tab but in my
>> scenario I need to extract two types of information from a document's path.
>> Right now I am only able to extract one piece of information and put it in
>> Solr; it would have been very useful to be able to perform other
>> transformations to the paths but it's OK, I can probably write a
>> transformation connector of my own.
>>
>> Thanks,
>> vigi
>> ------------------------------
>> Date: Fri, 5 Jun 2015 09:02:59 -0400
>> Subject: Re: Job definition metadata with multiple path attribute names
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>>
>>
>> Hi Vigi,
>>
>> You get, for free, the file name of the document as metadata, from all
>> repository connectors, including the jcifs connector:
>>
>> >>>>>>
>>                   rd.setFileName(fileNameString);
>> <<<<<<
>>
>> The problem is that this is not something you can manipulate in MCF via
>> regular expression with the current bevy of supplied transformation
>> connectors, because (a) it isn't generic metadata but a fixed property of
>> the document, and (b) the Metadata Transformer connector doesn't allow you
>> to slice and dice metadata in any case, just compose it into bigger strings.
>>
>> So you're stuck with either writing a document transformation connector
>> of your own, which does what you want, or proposing additional
>> functionality for the Metadata Transformer.  If it can be done in a
>> backwards compatible way, this is something I would support.
>>
>> I'm not thrilled with the idea of extending the JCIFS connector to build
>> multiple independent attributes all from the path; the UI for this
>> connector is already quite complex, and the functionality for generically
>> manipulating metadata would be useful in general anyway.
>>
>> Karl
>>
>>
>> On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:
>>
>> Hello guys,
>>
>> I have another Manifoldcf 2.0.2 question. Our process consists of
>> indexing some documents from a Windows Share and sending them to Solr. I
>> would like to extract some information from the documents and put it into
>> specific Solr fields. For example, based on the id of the document I am
>> currently extracting a specific folder name (using regular expressions on
>> the metadata tab of the job defintition) and storing it into Solr; this it
>> works fine.
>>
>> However, I also want to extract the file extension (using regex) and send
>> it to Solr but I am not able to add more than one path attribute name on
>> the Metadata tab of the job definition. I already have one that extracts a
>> particular folder name from the file path and I would need a second one for
>> the file extension.
>>
>> How would I be able to achieve this?
>>
>> Regards,
>> vigi
>>
>>
>>
>>
>>
>>
>

Mime
View raw message