Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 825F310F17 for ; Fri, 5 Jun 2015 13:33:41 +0000 (UTC) Received: (qmail 1276 invoked by uid 500); 5 Jun 2015 13:33:41 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 1214 invoked by uid 500); 5 Jun 2015 13:33:41 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 1204 invoked by uid 99); 5 Jun 2015 13:33:41 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Jun 2015 13:33:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E81761A459D for ; Fri, 5 Jun 2015 13:33:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.898 X-Spam-Level: ** X-Spam-Status: No, score=2.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id gcQEZpcB57xK for ; Fri, 5 Jun 2015 13:33:39 +0000 (UTC) Received: from mail-ig0-f175.google.com (mail-ig0-f175.google.com [209.85.213.175]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 6C7522054B for ; Fri, 5 Jun 2015 13:33:38 +0000 (UTC) Received: by igbpi8 with SMTP id pi8so15636165igb.1 for ; Fri, 05 Jun 2015 06:33:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=i7KD7fIliI5aehqLJcAp64o9QJEtulL8JXR8XkakZyA=; b=K5Qd0zv3MVqxRb5QBVIrcuwN77vblkR0WcH83AG8ZOeYOWQ2ShWKx8T7ozKCFPaOh9 ji5LmbYDRLFOv8FgbZM4W6D3cp7ROldJAMl20REU5S6vz28jecpG/cdEBJVBHy+5mpuc 0Ggh5AHqcz/0PetB2uVhH5AKZm4RdzJHyL51kQtQqs3J+vHGxvfhtOis+SUI/+ywGLaa 3YpKewRs4ZdODOZMU1+mzx5J92GM3V/4VqAJTZ1qxjU2K2pUEV+uN9gogbt8kONZQoit Da+Jdk9YqxN2tgjkpLk5hf4cc8+zarLhksOcXiefp49l/MrLQ/GN214yLgcJzOPCYI8c NL5Q== MIME-Version: 1.0 X-Received: by 10.107.170.80 with SMTP id t77mr4421897ioe.31.1433511217308; Fri, 05 Jun 2015 06:33:37 -0700 (PDT) Received: by 10.107.165.1 with HTTP; Fri, 5 Jun 2015 06:33:37 -0700 (PDT) In-Reply-To: References: Date: Fri, 5 Jun 2015 09:33:37 -0400 Message-ID: Subject: Re: Job definition metadata with multiple path attribute names From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a11415d58e3aa960517c5565c --001a11415d58e3aa960517c5565c Content-Type: text/plain; charset=UTF-8 Hi Vigi, I do understand your issue, but I'd propose a general solution of adding new functionality to the Metadata Transformer to achieve your goal. So the setup would be this: - Use the JCIFS connector Metadata tab to just include the entire path in the metadata - Use the Metadata Transformer to generate two different pieces of metadata, using a new regular expression modification feature that I would write for you, if we can come up with a design for it You can write your own completely new transformation connector, but that's no different than what I propose, and not as useful. Thanks, Karl On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R wrote: > Dear Karl, > > Maybe I misunderstood the applications for the metadata tab but in my > scenario I need to extract two types of information from a document's path. > Right now I am only able to extract one piece of information and put it in > Solr; it would have been very useful to be able to perform other > transformations to the paths but it's OK, I can probably write a > transformation connector of my own. > > Thanks, > vigi > ------------------------------ > Date: Fri, 5 Jun 2015 09:02:59 -0400 > Subject: Re: Job definition metadata with multiple path attribute names > From: daddywri@gmail.com > To: user@manifoldcf.apache.org > > > Hi Vigi, > > You get, for free, the file name of the document as metadata, from all > repository connectors, including the jcifs connector: > > >>>>>> > rd.setFileName(fileNameString); > <<<<<< > > The problem is that this is not something you can manipulate in MCF via > regular expression with the current bevy of supplied transformation > connectors, because (a) it isn't generic metadata but a fixed property of > the document, and (b) the Metadata Transformer connector doesn't allow you > to slice and dice metadata in any case, just compose it into bigger strings. > > So you're stuck with either writing a document transformation connector of > your own, which does what you want, or proposing additional functionality > for the Metadata Transformer. If it can be done in a backwards compatible > way, this is something I would support. > > I'm not thrilled with the idea of extending the JCIFS connector to build > multiple independent attributes all from the path; the UI for this > connector is already quite complex, and the functionality for generically > manipulating metadata would be useful in general anyway. > > Karl > > > On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R wrote: > > Hello guys, > > I have another Manifoldcf 2.0.2 question. Our process consists of indexing > some documents from a Windows Share and sending them to Solr. I would like > to extract some information from the documents and put it into specific > Solr fields. For example, based on the id of the document I am currently > extracting a specific folder name (using regular expressions on the > metadata tab of the job defintition) and storing it into Solr; this it > works fine. > > However, I also want to extract the file extension (using regex) and send > it to Solr but I am not able to add more than one path attribute name on > the Metadata tab of the job definition. I already have one that extracts a > particular folder name from the file path and I would need a second one for > the file extension. > > How would I be able to achieve this? > > Regards, > vigi > > > --001a11415d58e3aa960517c5565c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Vigi,

I do understand your issue, bu= t I'd propose a general solution of adding new functionality to the Met= adata Transformer to achieve your goal.=C2=A0 So the setup would be this:

- Use the JCIFS connector Metadata tab to just incl= ude the entire path in the metadata
- Use the Metadata Transforme= r to generate two different pieces of metadata, using a new regular express= ion modification feature that I would write for you, if we can come up with= a design for it

You can write your own completely= new transformation connector, but that's no different than what I prop= ose, and not as useful.

Thanks,
Karl



On Fri, Jun 5, 2015 at 9:17 AM, Virgiliu R <gosu= vigi@hotmail.com> wrote:
Dear Karl,

Maybe I misunderstood the applicati= ons for the metadata tab but in my scenario I need to extract two types of = information from a document's path. Right now I am only able to extract= one piece of information and put it in Solr; it would have been very usefu= l to be able to perform other transformations to the paths but it's OK,= I can probably write a transformation connector of my own.

Thanks,<= br>vigi

Date: Fri, 5 Jun 2015 09:02:59 -0400
Subject: Re: Jo= b definition metadata with multiple path attribute names
From: daddywri@gmail.com
= To: user@ma= nifoldcf.apache.org


Hi Vigi,

You get, for free, the file name= of the document as metadata, from all repository connectors, including the= jcifs connector:

>>>>>>
=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 rd.setFileName(fileNameString);
<<<<<<

<= /div>The problem is that this is not something you can manipulate in MCF vi= a regular expression with the current bevy of supplied transformation conne= ctors, because (a) it isn't generic metadata but a fixed property of th= e document, and (b) the Metadata Transformer connector doesn't allow yo= u to slice and dice metadata in any case, just compose it into bigger strin= gs.

So you're stuck with either writing a document transfo= rmation connector of your own, which does what you want, or proposing addit= ional functionality for the Metadata Transformer.=C2=A0 If it can be done i= n a backwards compatible way, this is something I would support.

I'm not thrilled with the idea of extending the JCIFS connector to b= uild multiple independent attributes all from the path; the UI for this con= nector is already quite complex, and the functionality for generically mani= pulating metadata would be useful in general anyway.

Karl
<= br>

On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <gosu= vigi@hotmail.com> wrote:
Hello guys,

I have another Manifoldcf 2.0.2 qu= estion. Our process consists of indexing some documents from a Windows Shar= e and sending them to Solr. I would like to extract some information from t= he documents and put it into specific Solr fields. For example, based on th= e id of the document I am currently extracting a specific folder name (usin= g regular expressions on the metadata tab of the job defintition) and stori= ng it into Solr; this it works fine.

However, I also want to extrac= t the file extension (using regex) and send it to Solr but I am not able to= add more than one path attribute name on the Metadata tab of the job defin= ition. I already have one that extracts a particular folder name from the f= ile path and I would need a second one for the file extension.

How w= ould I be able to achieve this?

Regards,
vigi


--001a11415d58e3aa960517c5565c--