manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marisol Redondo <marisol.redondo.gar...@gmail.com>
Subject Re: Metadata adjuster
Date Wed, 22 Feb 2017 15:12:38 GMT
Here you are:

View a Job

Top of Form


------------------------------

Name:

revenueToSites
------------------------------

Pipeline:

Stage

Type

Precedent

Description

Connection name

1.

Repository

Revenue Website

2.

Transformation

1.

Tikka Metadata Extractor

3.

Transformation

2.

Set mimeType and facetContentType

customField

4.

Output

3.

sites solr dev

Notifications:

Stage

Description

Connection name

No notification connections
------------------------------

Priority:

5

Start method:

Don't automatically start
------------------------------

Schedule type:

Scan every document once

Minimum recrawl interval:

Not applicable

Maximum recrawl interval:

Not applicable

Expiration interval:

Not applicable

Reseed interval:

Not applicable
------------------------------

No scheduled run times
------------------------------

Maximum hop count for link type 'link':

Unlimited

Maximum hop count for link type 'redirect':

Unlimited
------------------------------

Hop count mode:

Delete unreachable documents
------------------------------

1.

Seeds:

https://xxxxxx/index.aspx
<https://preview.revenuedomain.ie/en/press-office/index.aspx>
------------------------------

No canonicalization specified - all URLs will be reordered and have all
sessions removed
------------------------------

No mappings specified; will accept all URLs
------------------------------

Include only hosts matching seeds?

yes
------------------------------

Include in crawl:

.*
------------------------------

Include in index:

.*
------------------------------

Exclude from crawl:

\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script
type="text/javascript">)
[?*!@=].*
------------------------------

Exclude from index:
------------------------------

Exclude content from index:
------------------------------

No access tokens specified
------------------------------

Excluded headers:

last-modified
------------------------------

2.

Field mappings:

Metadata field name

Final field name

No field mapping specified
------------------------------

Keep all metadata:

true
------------------------------

Lower names:

false
------------------------------

Write limit:
------------------------------

Ignore Tika exceptions:

true
------------------------------

Boilerplate extractor:

-- No extraction selected --
------------------------------

3.

Metadata expressions:

Parameter name

Remove this parameter?

Expression ("${fieldname}" references a field)

facetContentType

false

site.ie
------------------------------

Keep all incoming metadata

false

Remove empty metadata values

false
------------------------------

4.

Bottom of Form



    Marisol Redondo

    Email: mredondo@revenue.ie

    Phone: 35428



Please note that Revenue cannot guarantee that any personal and
sensitive data, sent in plain text via standard email, is fully
secure. Customers who choose to use this channel are deemed to have
accepted any risk involved. The alternative communication methods
offered by Revenue include standard post and the option to use our
(encrypted) MyEnquiries service which is available within myAccount
and ROS. You can register for either myAccount or ROS on the Revenue
website.



Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht
a thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear
isteach i ngnáth-théacs trí r-phost caighdeánach go huile is go
hiomlán slán. Meastar go nglacann custaiméirí a úsáideann an cainéal
seo le haon riosca bainteach. I measc na modhanna cumarsáide eile atá
ag na Coimisinéirí ná post caighdeánach agus an rogha ár seirbhís
(criptithe) M'Fhiosruithe a úsáid, tá sí ar fáil laistigh de MoChúrsaí
agus ROS. Is féidir leat clárú le haghaidh ceachtar MoChúrsaí nó ROS
ar shuíomh gréasáin na gCoimisinéirí.



On 22 February 2017 at 14:53, Karl Wright <daddywri@gmail.com> wrote:

> Hi Marisol,
>
> The [INFO] log entries indicate that your document has almost no metadata
> at all.  But the Metadata Adjuster transformation connector is designed to
> do exactly what you want.
>
> Can you view your job, and cut and paste the View Job page into an email,
> so I can see how your metadata adjuster transformation connection and your
> solr output connections are configured?  Thanks!
>
> Karl
>
>
>
>
> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo <
> marisol.redondo.garcia@gmail.com> wrote:
>
>> Hi  Karl and thank you for this quick answer.
>>
>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, sorry
>> for the confusion, and I think this version is compatible with solr6.
>> The pdf doesn't have any metadata or field called facetContentType, this
>> is because I'd been trying to use the Metadata Adjuster, to add a new
>> metadata/property to the doc so solr can index by this field when I'm
>> injecting the doc.
>> Should I use other transformation or is there any other way of duing it?
>> I am migrating from nutch to ManifoldCF and in nutch we can do it with
>> plugins, and I was thinking that the plugins in nutch are the same as the
>> transformation connectors in MCF
>>
>> The completed error in solr is :
>>
>> 017-02-21 13:19:32.108 INFO  (qtp1854778591-18) [   x:sites]
>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type =
>>> requestHandler,name = /update/extract,class = solr.extraction.ExtractingRequestHandler,args
>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_
>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}}
>>
>> 2017-02-21 13:19:32.454 INFO  (qtp1854778591-18) [   x:sites]
>>> o.a.s.u.p.LogUpdateProcessorFactory [sites]  webapp=/solr path=/up
>>
>> date/extract params={resource.name=introduction.pdf&literal.id=https://..
>>> .../introduction.pdf&wt=xml&version=2.2}{} 0 347
>>
>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [   x:sites]
>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [
>>
>> doc=https://....../introduction.pdf] missing required field:
>>> facetContentType
>>
>>         at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu
>>> ilder.java:197)
>>
>>         at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad
>>> dUpdateCommand.java:82)
>>
>>         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
>>> irectUpdateHandler2.java:277)
>>
>>         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
>>> dateHandler2.java:211)
>>
>>
>>
>> Thanks
>>
>>
>> On 21 February 2017 at 14:52, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Marisol,
>>>
>>> Can you find the [INFO] entry in the Solr log for this document?  That
>>> should help clear up any confusion.
>>>
>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up to
>>> date with Solr 6.x.  That could be the source of the problem  Is there any
>>> reason you are using a 1.x version of MCF?
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo <
>>> marisol.redondo.garcia@gmail.com> wrote:
>>>
>>>> Hi.
>>>>
>>>> I'm trying to use metadata adjuster to add one field to the solr index
>>>> but doesn't inject the field into a solr's field.
>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have
>>>> read in the documentation (https://manifoldcf.apache.org
>>>> /release/release-1.10/en_US/end-user-documentation.html) that I can
>>>> add metadata to the document that is going to be indexed into solr, but the
>>>> solr instance gave me the error "missing required field:
>>>> facetContentType".
>>>>
>>>> ManifoldCF Job pipeline:
>>>> 1. Repository (type web repository)
>>>> 2. Transformation (Tikka Metadata Extractor)
>>>> 3. Transformation (type Metada Adjuster)
>>>> 4. Output (Solr 6)
>>>>
>>>> ManifoldCF Job Metadata Expressions tab:
>>>>   Parameter name: "facetContentType"
>>>>   Remove this parameter: false
>>>>   Expresion: xxxx  (the literal text value I want in facetContentType)
>>>>
>>>> Solr schema:
>>>>   .....
>>>>   <field name="facetContentType" type="string" indexed="true"
>>>> stored="true" required="true"/>
>>>>  ....
>>>>
>>>> The error logged in ManifoldCF is:
>>>>       Error from server at http://solrServer:port/solr/c
>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https://
>>>> ....../index.aspx] missing required field: facetContentType.
>>>>
>>>> Thanks for your help
>>>>
>>>
>>>
>>
>

Mime
View raw message