manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Metadata expressions
Date Wed, 19 Aug 2015 14:02:02 GMT
Hi Mike,
Can you tell me what version of MCF you are using?

Thanks,
Karl


On Wed, Aug 19, 2015 at 9:56 AM, Karl Wright <daddywri@gmail.com> wrote:

> I've created a ticket, CONNECTORS-1229.  Will be looking at this shortly.
>
> Karl
>
>
> On Wed, Aug 19, 2015 at 8:21 AM, Mike Caceres <miguel151@hotmail.com>
> wrote:
>
>> Thank you for the examples Karl.
>>
>> However, when I include this definition in the job definition and then
>> run the job, it seems like ManifoldCF enters in some kind of loop in the
>> running state. Looking at the manifoldcf.log file I see many times this
>> kind of entries:
>>
>> >>>>>>
>>
>> FATAL 2015-08-19 07:51:48,231 (Worker thread '70') -
>> org.apache.manifoldcf.crawlerthreads - Error tossed: null
>> java.lang.NullPointerException
>>         at
>> org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.append(ForcedMetadataConnector.java:646)
>>         at
>> org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.processExpression(ForcedMetadataConnector.java:678)
>>         at
>> org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.addOrReplaceDocumentWithException(ForcedMetadataConnector.java:134)
>>         at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221)
>>         at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3072)
>>         at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2706)
>>         at
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
>>         at
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1503)
>>         at
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1468)
>>         at
>> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1813)
>>         at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:379)
>>
>> <<<<<<
>>
>> Which may or may not be related to this earlier messages in the same log
>> file:
>>
>> >>>>>>
>>  INFO 2015-08-19 07:47:47,307 (main) - org.apache.manifoldcf.root -
>> Synchronization storage cleaned up
>>  INFO 2015-08-19 07:48:07,830 (main) - org.apache.manifoldcf.root -
>> Running...
>>  INFO 2015-08-19 07:48:07,846 (main) - org.apache.manifoldcf.root -
>> Running...
>>  INFO 2015-08-19 07:48:07,994 (Agents thread) -
>> org.apache.manifoldcf.jobs - Cleaning up all process data
>>  INFO 2015-08-19 07:48:08,036 (Agents thread) -
>> org.apache.manifoldcf.jobs - Cleanup complete
>>  INFO 2015-08-19 07:48:08,064 (Agents thread) -
>> org.apache.manifoldcf.jobs - Starting cluster
>>  INFO 2015-08-19 07:48:08,072 (Agents thread) -
>> org.apache.manifoldcf.jobs - Cluster start complete
>>  INFO 2015-08-19 07:48:08,075 (Agents thread) -
>> org.apache.manifoldcf.root - Starting up pull-agent...
>>  INFO 2015-08-19 07:48:08,088 (Agents thread) -
>> org.apache.manifoldcf.root - Starting up pull-agent...
>>  INFO 2015-08-19 07:48:08,133 (Agents thread) -
>> org.apache.manifoldcf.root - Pull-agent started
>>  INFO 2015-08-19 07:48:08,182 (Agents thread) -
>> org.apache.manifoldcf.root - Pull-agent started
>> ERROR 2015-08-19 07:48:44,184 (qtp858007949-11) -
>> org.apache.manifoldcf.misc - Missing resource
>> 'ForcedMetadata.ForcedMetadataNameMustNotBeNull' in bundle
>> 'org.apache.manifoldcf.agents.transformation.forcedmetadata.common' for
>> locale 'en_US'
>> java.util.MissingResourceException: Can't find resource for bundle
>> java.util.PropertyResourceBundle, key
>> ForcedMetadata.ForcedMetadataNameMustNotBeNull
>>         at java.util.ResourceBundle.getObject(ResourceBundle.java:395)
>>         at java.util.ResourceBundle.getString(ResourceBundle.java:355)
>>         at
>> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:193)
>>         at
>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:240)
>>         at
>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:208)
>>         at
>> org.apache.manifoldcf.ui.i18n.ResourceBundleWrapper.getString(ResourceBundleWrapper.java:44)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at .......
>> <<<<<<
>>
>> if I edit the job definition and remove the regular expression and save
>> the job, then almost immediately I can see this entries in the log:
>>
>> >>>>>>
>>  INFO 2015-08-19 07:52:28,300 (Finisher thread) -
>> org.apache.manifoldcf.jobs - Marked job 1439951495926 for shutdown
>>  INFO 2015-08-19 07:52:28,434 (Job reset thread) -
>> org.apache.manifoldcf.jobs - Job 1439951495926 now completed
>>  INFO 2015-08-19 07:52:38,332 (Job notification thread) -
>> org.apache.manifoldcf.jobs - Found job 1439951495926 in need of
>> notification
>> <<<<<<
>>
>> Thank you,
>>
>> Mike
>> ------------------------------
>> Date: Wed, 19 Aug 2015 03:45:30 -0400
>> Subject: Re: Metadata expressions
>> From: daddywri@gmail.com
>> To: user@manifoldcf.apache.org
>>
>>
>> Hi Mike,
>>
>> The documentation (which seems not to have updated on the site yet) says
>> the following:
>>
>> >>>>>>
>>                 <p>You can also use regular expressions in the
>> substitution string, for example: "${there|[0-9]*}", which will extract the
>> first sequence of sequential numbers it finds in the
>>                       value of the field "there", or
>> "${there|string(.*)|1}", which will include everything following "string"
>> in the field value.  (The third argument specifies the regular
>>                       expression group number, with an optional suffix of
>> "l" or "u" meaning upper-case or lower-case.)</p>
>>                 <p>Enter a parameter name, and either select to remove
>> the value or provide an expression.  If you chose to supply an expression,
>> enter the expression in the box.
>> <<<<<<
>>
>> To evaluate your regular expression with the specific input you gave, I
>> typically use a regex applet, if you can find a browser that still allows
>> it:
>>
>> http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
>>
>> Dropping your stuff in and clicking the "find()" button yields this:
>> "Pattern did not match"
>>
>> So your regex is not correct.  But, "Protocol (\d+)" does match, with the
>> following group outputs:
>>
>> start() = 0, end() = 16
>> group(0) = "Protocol 1234500"
>> group(1) = "1234500"
>>
>> So you want group 1.  Therefore, the MCF expression would be:
>>
>> expression = Protocol-${protocol_name|Protocol (\d+)|1}
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Tue, Aug 18, 2015 at 11:19 PM, Mike Caceres <miguel151@hotmail.com>
>> wrote:
>>
>> If I have a document with the following metadata values:
>> "protocol_name" : "Protocol 1234500 (USPA00012345) second version"
>>
>> and I want to produce a new metadata field that looks like this:
>>
>> "protocol_id" : "Protocol-1234500"
>>
>> should the metadata expression look like this?
>>
>> parameter name = protocol_id
>> remove this parameter = false
>> expression = Protocol-${protocol_name|string(\d+)|0}
>>
>> Thank you!
>>
>>
>>
>

Mime
View raw message