manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marisol Redondo <marisol.redondo.gar...@gmail.com>
Subject Re: Metadata adjuster
Date Wed, 22 Feb 2017 15:37:46 GMT
I was trying with "Keep all incoming metadata" set to false and too true,
but I'll take your advice and set to true.

I don't know why you can't see it, but it's the 4 stage

On 22 February 2017 at 15:26, Karl Wright <daddywri@gmail.com> wrote:

> Hi Marisol,
>
> Some observations.
> (1) It makes no sense to have "Keep all incoming metadata" set to false,
> since that will filter out everything that your tika extractor extracts.  I
> doubt that is what you have intended.
> (2) I can't see the Solr output configuration -- looks like it got
> truncated?
>
> Thanks,
> Karl
>
>
> On Wed, Feb 22, 2017 at 10:12 AM, Marisol Redondo <
> marisol.redondo.garcia@gmail.com> wrote:
>
>> Here you are:
>>
>> View a Job
>>
>> Top of Form
>>
>>
>> ------------------------------
>>
>> Name:
>>
>> revenueToSites
>> ------------------------------
>>
>> Pipeline:
>>
>> Stage
>>
>> Type
>>
>> Precedent
>>
>> Description
>>
>> Connection name
>>
>> 1.
>>
>> Repository
>>
>> Revenue Website
>>
>> 2.
>>
>> Transformation
>>
>> 1.
>>
>> Tikka Metadata Extractor
>>
>> 3.
>>
>> Transformation
>>
>> 2.
>>
>> Set mimeType and facetContentType
>>
>> customField
>>
>> 4.
>>
>> Output
>>
>> 3.
>>
>> sites solr dev
>>
>> Notifications:
>>
>> Stage
>>
>> Description
>>
>> Connection name
>>
>> No notification connections
>> ------------------------------
>>
>> Priority:
>>
>> 5
>>
>> Start method:
>>
>> Don't automatically start
>> ------------------------------
>>
>> Schedule type:
>>
>> Scan every document once
>>
>> Minimum recrawl interval:
>>
>> Not applicable
>>
>> Maximum recrawl interval:
>>
>> Not applicable
>>
>> Expiration interval:
>>
>> Not applicable
>>
>> Reseed interval:
>>
>> Not applicable
>> ------------------------------
>>
>> No scheduled run times
>> ------------------------------
>>
>> Maximum hop count for link type 'link':
>>
>> Unlimited
>>
>> Maximum hop count for link type 'redirect':
>>
>> Unlimited
>> ------------------------------
>>
>> Hop count mode:
>>
>> Delete unreachable documents
>> ------------------------------
>>
>> 1.
>>
>> Seeds:
>>
>> https://xxxxxx/index.aspx
>> <https://preview.revenuedomain.ie/en/press-office/index.aspx>
>> ------------------------------
>>
>> No canonicalization specified - all URLs will be reordered and have all
>> sessions removed
>> ------------------------------
>>
>> No mappings specified; will accept all URLs
>> ------------------------------
>>
>> Include only hosts matching seeds?
>>
>> yes
>> ------------------------------
>>
>> Include in crawl:
>>
>> .*
>> ------------------------------
>>
>> Include in index:
>>
>> .*
>> ------------------------------
>>
>> Exclude from crawl:
>>
>> \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
>> wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|
>> EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script
>> type="text/javascript">)
>> [?*!@=].*
>> ------------------------------
>>
>> Exclude from index:
>> ------------------------------
>>
>> Exclude content from index:
>> ------------------------------
>>
>> No access tokens specified
>> ------------------------------
>>
>> Excluded headers:
>>
>> last-modified
>> ------------------------------
>>
>> 2.
>>
>> Field mappings:
>>
>> Metadata field name
>>
>> Final field name
>>
>> No field mapping specified
>> ------------------------------
>>
>> Keep all metadata:
>>
>> true
>> ------------------------------
>>
>> Lower names:
>>
>> false
>> ------------------------------
>>
>> Write limit:
>> ------------------------------
>>
>> Ignore Tika exceptions:
>>
>> true
>> ------------------------------
>>
>> Boilerplate extractor:
>>
>> -- No extraction selected --
>> ------------------------------
>>
>> 3.
>>
>> Metadata expressions:
>>
>> Parameter name
>>
>> Remove this parameter?
>>
>> Expression ("${fieldname}" references a field)
>>
>> facetContentType
>>
>> false
>>
>> site.ie
>> ------------------------------
>>
>> Keep all incoming metadata
>>
>> false
>>
>> Remove empty metadata values
>>
>> false
>> ------------------------------
>>
>> 4.
>>
>> Bottom of Form
>>
>>
>>
>>     Marisol Redondo
>>
>>     Email: mredondo@revenue.ie
>>
>>     Phone: 35428
>>
>>
>>
>> Please note that Revenue cannot guarantee that any personal and sensitive data, sent
in plain text via standard email, is fully secure. Customers who choose to use this channel
are deemed to have accepted any risk involved. The alternative communication methods offered
by Revenue include standard post and the option to use our (encrypted) MyEnquiries service
which is available within myAccount and ROS. You can register for either myAccount or ROS
on the Revenue website.
>>
>>
>>
>> Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a thabhairt
go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i ngnáth-théacs trí r-phost
caighdeánach go huile is go hiomlán slán. Meastar go nglacann custaiméirí a úsáideann
an cainéal seo le haon riosca bainteach. I measc na modhanna cumarsáide eile atá ag na
Coimisinéirí ná post caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe
a úsáid, tá sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh
ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí.
>>
>>
>>
>> On 22 February 2017 at 14:53, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Marisol,
>>>
>>> The [INFO] log entries indicate that your document has almost no
>>> metadata at all.  But the Metadata Adjuster transformation connector is
>>> designed to do exactly what you want.
>>>
>>> Can you view your job, and cut and paste the View Job page into an
>>> email, so I can see how your metadata adjuster transformation connection
>>> and your solr output connections are configured?  Thanks!
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo <
>>> marisol.redondo.garcia@gmail.com> wrote:
>>>
>>>> Hi  Karl and thank you for this quick answer.
>>>>
>>>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5,
>>>> sorry for the confusion, and I think this version is compatible with solr6.
>>>> The pdf doesn't have any metadata or field called facetContentType,
>>>> this is because I'd been trying to use the Metadata Adjuster, to add a new
>>>> metadata/property to the doc so solr can index by this field when I'm
>>>> injecting the doc.
>>>> Should I use other transformation or is there any other way of duing it?
>>>> I am migrating from nutch to ManifoldCF and in nutch we can do it with
>>>> plugins, and I was thinking that the plugins in nutch are the same as the
>>>> transformation connectors in MCF
>>>>
>>>> The completed error in solr is :
>>>>
>>>> 017-02-21 13:19:32.108 INFO  (qtp1854778591-18) [   x:sites]
>>>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type =
>>>>> requestHandler,name = /update/extract,class = solr.extraction.ExtractingRequestHandler,args
>>>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_
>>>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}}
>>>>
>>>> 2017-02-21 13:19:32.454 INFO  (qtp1854778591-18) [   x:sites]
>>>>> o.a.s.u.p.LogUpdateProcessorFactory [sites]  webapp=/solr path=/up
>>>>
>>>> date/extract params={resource.name=introduction.pdf&literal.id=https://
>>>>> ...../introduction.pdf&wt=xml&version=2.2}{} 0 347
>>>>
>>>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [   x:sites]
>>>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [
>>>>
>>>> doc=https://....../introduction.pdf] missing required field:
>>>>> facetContentType
>>>>
>>>>         at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu
>>>>> ilder.java:197)
>>>>
>>>>         at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad
>>>>> dUpdateCommand.java:82)
>>>>
>>>>         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
>>>>> irectUpdateHandler2.java:277)
>>>>
>>>>         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
>>>>> dateHandler2.java:211)
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On 21 February 2017 at 14:52, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Marisol,
>>>>>
>>>>> Can you find the [INFO] entry in the Solr log for this document?  That
>>>>> should help clear up any confusion.
>>>>>
>>>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up
>>>>> to date with Solr 6.x.  That could be the source of the problem  Is there
>>>>> any reason you are using a 1.x version of MCF?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo <
>>>>> marisol.redondo.garcia@gmail.com> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I'm trying to use metadata adjuster to add one field to the solr
>>>>>> index but doesn't inject the field into a solr's field.
>>>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have
>>>>>> read in the documentation (https://manifoldcf.apache.org
>>>>>> /release/release-1.10/en_US/end-user-documentation.html) that I can
>>>>>> add metadata to the document that is going to be indexed into solr,
but the
>>>>>> solr instance gave me the error "missing required field:
>>>>>> facetContentType".
>>>>>>
>>>>>> ManifoldCF Job pipeline:
>>>>>> 1. Repository (type web repository)
>>>>>> 2. Transformation (Tikka Metadata Extractor)
>>>>>> 3. Transformation (type Metada Adjuster)
>>>>>> 4. Output (Solr 6)
>>>>>>
>>>>>> ManifoldCF Job Metadata Expressions tab:
>>>>>>   Parameter name: "facetContentType"
>>>>>>   Remove this parameter: false
>>>>>>   Expresion: xxxx  (the literal text value I want in facetContentType)
>>>>>>
>>>>>> Solr schema:
>>>>>>   .....
>>>>>>   <field name="facetContentType" type="string" indexed="true"
>>>>>> stored="true" required="true"/>
>>>>>>  ....
>>>>>>
>>>>>> The error logged in ManifoldCF is:
>>>>>>       Error from server at http://solrServer:port/solr/c
>>>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https://
>>>>>> ....../index.aspx] missing required field: facetContentType.
>>>>>>
>>>>>> Thanks for your help
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message