manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Caceres <>
Subject RE: Metadata expressions
Date Wed, 19 Aug 2015 14:26:07 GMT
sure: apache-manifoldcf-2.1/multiprocess-file-example, using PostgreSQL as database
MikeDate: Wed, 19 Aug 2015 10:02:02 -0400
Subject: Re: Metadata expressions

Hi Mike,Can you tell me what version of MCF you are using?

On Wed, Aug 19, 2015 at 9:56 AM, Karl Wright <> wrote:
I've created a ticket, CONNECTORS-1229.  Will be looking at this shortly.

On Wed, Aug 19, 2015 at 8:21 AM, Mike Caceres <> wrote:

Thank you for the examples Karl.
However, when I include this definition in the job definition and then run the job, it seems
like ManifoldCF enters in some kind of loop in the running state. Looking at the manifoldcf.log
file I see many times this kind of entries:
FATAL 2015-08-19 07:51:48,231 (Worker thread '70') - org.apache.manifoldcf.crawlerthreads
- Error tossed: nulljava.lang.NullPointerException        at org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.append(
       at org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.processExpression(
       at org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.addOrReplaceDocumentWithException(
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(
       at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(
       at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(
       at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(
Which may or may not be related to this earlier messages in the same log file:
>>>>>> INFO 2015-08-19 07:47:47,307 (main) - org.apache.manifoldcf.root
- Synchronization storage cleaned up INFO 2015-08-19 07:48:07,830 (main) - org.apache.manifoldcf.root
- Running... INFO 2015-08-19 07:48:07,846 (main) - org.apache.manifoldcf.root - Running...
INFO 2015-08-19 07:48:07,994 (Agents thread) - - Cleaning up all
process data INFO 2015-08-19 07:48:08,036 (Agents thread) - - Cleanup
complete INFO 2015-08-19 07:48:08,064 (Agents thread) - - Starting
cluster INFO 2015-08-19 07:48:08,072 (Agents thread) - - Cluster
start complete INFO 2015-08-19 07:48:08,075 (Agents thread) - org.apache.manifoldcf.root -
Starting up pull-agent... INFO 2015-08-19 07:48:08,088 (Agents thread) - org.apache.manifoldcf.root
- Starting up pull-agent... INFO 2015-08-19 07:48:08,133 (Agents thread) - org.apache.manifoldcf.root
- Pull-agent started INFO 2015-08-19 07:48:08,182 (Agents thread) - org.apache.manifoldcf.root
- Pull-agent startedERROR 2015-08-19 07:48:44,184 (qtp858007949-11) - org.apache.manifoldcf.misc
- Missing resource 'ForcedMetadata.ForcedMetadataNameMustNotBeNull' in bundle 'org.apache.manifoldcf.agents.transformation.forcedmetadata.common'
for locale 'en_US'java.util.MissingResourceException: Can't find resource for bundle java.util.PropertyResourceBundle,
key ForcedMetadata.ForcedMetadataNameMustNotBeNull        at java.util.ResourceBundle.getObject(
       at java.util.ResourceBundle.getString(        at org.apache.manifoldcf.core.i18n.Messages.getMessage(
       at org.apache.manifoldcf.core.i18n.Messages.getString(        at
org.apache.manifoldcf.core.i18n.Messages.getString(        at org.apache.manifoldcf.ui.i18n.ResourceBundleWrapper.getString(
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(
       at .......<<<<<<
if I edit the job definition and remove the regular expression and save the job, then almost
immediately I can see this entries in the log:
>>>>>> INFO 2015-08-19 07:52:28,300 (Finisher thread) -
- Marked job 1439951495926 for shutdown INFO 2015-08-19 07:52:28,434 (Job reset thread) - - Job 1439951495926 now completed INFO 2015-08-19 07:52:38,332
(Job notification thread) - - Found job 1439951495926 in need of
Thank you,
Date: Wed, 19 Aug 2015 03:45:30 -0400
Subject: Re: Metadata expressions

Hi Mike,
The documentation (which seems not to have updated on the site yet) says the following:
>>>>>>                <p>You can also use regular expressions in the
substitution string, for example: "${there|[0-9]*}", which will extract the first sequence
of sequential numbers it finds in the                      value of the field "there", or
"${there|string(.*)|1}", which will include everything following "string" in the field value.
 (The third argument specifies the regular                      expression group number, with
an optional suffix of "l" or "u" meaning upper-case or lower-case.)</p>            
   <p>Enter a parameter name, and either select to remove the value or provide an expression.
 If you chose to supply an expression, enter the expression in the box.<<<<<<
To evaluate your regular expression with the specific input you gave, I typically use a regex
applet, if you can find a browser that still allows it:

Dropping your stuff in and clicking the "find()" button yields this:"Pattern did not match"
So your regex is not correct.  But, "Protocol (\d+)" does match, with the following group
start() = 0, end() = 16group(0) = "Protocol 1234500"group(1) = "1234500"
So you want group 1.  Therefore, the MCF expression would be:
expression = Protocol-${protocol_name|Protocol (\d+)|1}


On Tue, Aug 18, 2015 at 11:19 PM, Mike Caceres <> wrote:

If I have a document with the following metadata values:"protocol_name" : "Protocol 1234500
(USPA00012345) second version"
and I want to produce a new metadata field that looks like this:
"protocol_id" : "Protocol-1234500"
should the metadata expression look like this?
parameter name = protocol_id remove this parameter = false expression = Protocol-${protocol_name|string(\d+)|0}
Thank you! 		 	   		  


View raw message