manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Metadata expressions
Date Wed, 19 Aug 2015 13:56:52 GMT
I've created a ticket, CONNECTORS-1229.  Will be looking at this shortly.

Karl


On Wed, Aug 19, 2015 at 8:21 AM, Mike Caceres <miguel151@hotmail.com> wrote:

> Thank you for the examples Karl.
>
> However, when I include this definition in the job definition and then run
> the job, it seems like ManifoldCF enters in some kind of loop in the
> running state. Looking at the manifoldcf.log file I see many times this
> kind of entries:
>
> >>>>>>
>
> FATAL 2015-08-19 07:51:48,231 (Worker thread '70') -
> org.apache.manifoldcf.crawlerthreads - Error tossed: null
> java.lang.NullPointerException
>         at
> org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.append(ForcedMetadataConnector.java:646)
>         at
> org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.processExpression(ForcedMetadataConnector.java:678)
>         at
> org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.addOrReplaceDocumentWithException(ForcedMetadataConnector.java:134)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3072)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2706)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1503)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1468)
>         at
> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1813)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:379)
>
> <<<<<<
>
> Which may or may not be related to this earlier messages in the same log
> file:
>
> >>>>>>
>  INFO 2015-08-19 07:47:47,307 (main) - org.apache.manifoldcf.root -
> Synchronization storage cleaned up
>  INFO 2015-08-19 07:48:07,830 (main) - org.apache.manifoldcf.root -
> Running...
>  INFO 2015-08-19 07:48:07,846 (main) - org.apache.manifoldcf.root -
> Running...
>  INFO 2015-08-19 07:48:07,994 (Agents thread) - org.apache.manifoldcf.jobs
> - Cleaning up all process data
>  INFO 2015-08-19 07:48:08,036 (Agents thread) - org.apache.manifoldcf.jobs
> - Cleanup complete
>  INFO 2015-08-19 07:48:08,064 (Agents thread) - org.apache.manifoldcf.jobs
> - Starting cluster
>  INFO 2015-08-19 07:48:08,072 (Agents thread) - org.apache.manifoldcf.jobs
> - Cluster start complete
>  INFO 2015-08-19 07:48:08,075 (Agents thread) - org.apache.manifoldcf.root
> - Starting up pull-agent...
>  INFO 2015-08-19 07:48:08,088 (Agents thread) - org.apache.manifoldcf.root
> - Starting up pull-agent...
>  INFO 2015-08-19 07:48:08,133 (Agents thread) - org.apache.manifoldcf.root
> - Pull-agent started
>  INFO 2015-08-19 07:48:08,182 (Agents thread) - org.apache.manifoldcf.root
> - Pull-agent started
> ERROR 2015-08-19 07:48:44,184 (qtp858007949-11) -
> org.apache.manifoldcf.misc - Missing resource
> 'ForcedMetadata.ForcedMetadataNameMustNotBeNull' in bundle
> 'org.apache.manifoldcf.agents.transformation.forcedmetadata.common' for
> locale 'en_US'
> java.util.MissingResourceException: Can't find resource for bundle
> java.util.PropertyResourceBundle, key
> ForcedMetadata.ForcedMetadataNameMustNotBeNull
>         at java.util.ResourceBundle.getObject(ResourceBundle.java:395)
>         at java.util.ResourceBundle.getString(ResourceBundle.java:355)
>         at
> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:193)
>         at
> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:240)
>         at
> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:208)
>         at
> org.apache.manifoldcf.ui.i18n.ResourceBundleWrapper.getString(ResourceBundleWrapper.java:44)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at .......
> <<<<<<
>
> if I edit the job definition and remove the regular expression and save
> the job, then almost immediately I can see this entries in the log:
>
> >>>>>>
>  INFO 2015-08-19 07:52:28,300 (Finisher thread) -
> org.apache.manifoldcf.jobs - Marked job 1439951495926 for shutdown
>  INFO 2015-08-19 07:52:28,434 (Job reset thread) -
> org.apache.manifoldcf.jobs - Job 1439951495926 now completed
>  INFO 2015-08-19 07:52:38,332 (Job notification thread) -
> org.apache.manifoldcf.jobs - Found job 1439951495926 in need of
> notification
> <<<<<<
>
> Thank you,
>
> Mike
> ------------------------------
> Date: Wed, 19 Aug 2015 03:45:30 -0400
> Subject: Re: Metadata expressions
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
>
>
> Hi Mike,
>
> The documentation (which seems not to have updated on the site yet) says
> the following:
>
> >>>>>>
>                 <p>You can also use regular expressions in the
> substitution string, for example: "${there|[0-9]*}", which will extract the
> first sequence of sequential numbers it finds in the
>                       value of the field "there", or
> "${there|string(.*)|1}", which will include everything following "string"
> in the field value.  (The third argument specifies the regular
>                       expression group number, with an optional suffix of
> "l" or "u" meaning upper-case or lower-case.)</p>
>                 <p>Enter a parameter name, and either select to remove the
> value or provide an expression.  If you chose to supply an expression,
> enter the expression in the box.
> <<<<<<
>
> To evaluate your regular expression with the specific input you gave, I
> typically use a regex applet, if you can find a browser that still allows
> it:
>
> http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
>
> Dropping your stuff in and clicking the "find()" button yields this:
> "Pattern did not match"
>
> So your regex is not correct.  But, "Protocol (\d+)" does match, with the
> following group outputs:
>
> start() = 0, end() = 16
> group(0) = "Protocol 1234500"
> group(1) = "1234500"
>
> So you want group 1.  Therefore, the MCF expression would be:
>
> expression = Protocol-${protocol_name|Protocol (\d+)|1}
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 18, 2015 at 11:19 PM, Mike Caceres <miguel151@hotmail.com>
> wrote:
>
> If I have a document with the following metadata values:
> "protocol_name" : "Protocol 1234500 (USPA00012345) second version"
>
> and I want to produce a new metadata field that looks like this:
>
> "protocol_id" : "Protocol-1234500"
>
> should the metadata expression look like this?
>
> parameter name = protocol_id
> remove this parameter = false
> expression = Protocol-${protocol_name|string(\d+)|0}
>
> Thank you!
>
>
>

Mime
View raw message