manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Virgiliu R <gosuv...@hotmail.com>
Subject RE: Job definition metadata with multiple path attribute names
Date Fri, 05 Jun 2015 13:17:47 GMT
Dear Karl,

Maybe I misunderstood the applications for the metadata tab but in my scenario I need to extract
two types of information from a document's path. Right now I am only able to extract one piece
of information and put it in Solr; it would have been very useful to be able to perform other
transformations to the paths but it's OK, I can probably write a transformation connector
of my own.

Thanks,
vigi
Date: Fri, 5 Jun 2015 09:02:59 -0400
Subject: Re: Job definition metadata with multiple path attribute names
From: daddywri@gmail.com
To: user@manifoldcf.apache.org

Hi Vigi,

You get, for free, the file name of the document as metadata, from all repository connectors,
including the jcifs connector:

>>>>>>
                  rd.setFileName(fileNameString);
<<<<<<

The problem is that this is not something you can manipulate in MCF via regular expression
with the current bevy of supplied transformation connectors, because (a) it isn't generic
metadata but a fixed property of the document, and (b) the Metadata Transformer connector
doesn't allow you to slice and dice metadata in any case, just compose it into bigger strings.

So you're stuck with either writing a document transformation connector of your own, which
does what you want, or proposing additional functionality for the Metadata Transformer.  If
it can be done in a backwards compatible way, this is something I would support.

I'm not thrilled with the idea of extending the JCIFS connector to build multiple independent
attributes all from the path; the UI for this connector is already quite complex, and the
functionality for generically manipulating metadata would be useful in general anyway.

Karl


On Fri, Jun 5, 2015 at 8:37 AM, Virgiliu R <gosuvigi@hotmail.com> wrote:



Hello guys,

I have another Manifoldcf 2.0.2 question. Our process consists of indexing some documents
from a Windows Share and sending them to Solr. I would like to extract some information from
the documents and put it into specific Solr fields. For example, based on the id of the document
I am currently extracting a specific folder name (using regular expressions on the metadata
tab of the job defintition) and storing it into Solr; this it works fine. 

However, I also want to extract the file extension (using regex) and send it to Solr but I
am not able to add more than one path attribute name on the Metadata tab of the job definition.
I already have one that extracts a particular folder name from the file path and I would need
a second one for the file extension.

How would I be able to achieve this?

Regards,
vigi
 		 	   		  

 		 	   		  
Mime
View raw message