manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Windows-Share to Solr is not working properly
Date Fri, 28 Mar 2014 22:44:40 GMT
Hi Alexander,

Which version of solr are you using? 

Please try these steps:

1) Set literalsOverride=true in solrconfig.xml (default section of extraction request handler)

2) Set fmap.date=ignored_date in solrconfig.xml (default section of extraction request handler)

If none of above works, don't worry, this will work for sure. FirstFieldValueUpdateProcessorFactory
will convert multi valued field into single valued one.

 <updateRequestProcessorChain name="remove">

    <processor class="solr.FirstFieldValueUpdateProcessorFactory">
        <str name="fieldName">date</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
  
  <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
   <lst name="defaults">     
      <str name="update.chain">remove</str>
   </lst>  
  </requestHandler>

Ahmet

On Friday, March 28, 2014 6:53 PM, Karl Wright <daddywri@gmail.com> wrote:

Hi Alexander,

I do understand your problem.  But I assure you that ManifoldCF does not (and never did)
extract metadata fields from binary documents.  Are you sure this is happening in ManifoldCF? 
Perhaps you have a Tika pipeline configured in Solr?

Karl





On Fri, Mar 28, 2014 at 11:47 AM, Alexander Stoffers <stoffers@modell-aachen.de> wrote:

Hi Karl,
>
>thank you for you quick response!
>
>I´m sorry for my bad English skills, but i try to get it more clear:
>
>I actually don´t understand where ManifoldCF processes/maps a metadata field "date",
after crawling a pdf document. We tried to explore the issue and we figured out that somewhere
in the process the metadata field "ModDate" of the document itself is mapped to the metadata
field "date". Furthermore the magic "date" field get´s an array.
>
>If we delete the metadata field "ModDate" of the document, the metadata field "date" used
in the ManifoldCF process disapears.
>
>If we don´t delete the field "ModDate" of the document, and try to map the field "date"
to something else or blank, the date field is processed to the Solr output connector, so that
Solr will fail, because the date field is an array and the Solr schema expacts an single value
for it´s date field.
>
>I hope that i could explain our problem a little bit better :-)
>
>Best Regards
>Alex
>
>----- Ursprüngliche Mail -----
>Von: "Karl Wright" <daddywri@gmail.com>
>An: user@manifoldcf.apache.org
>Gesendet: Freitag, 28. März 2014 15:29:11
>Betreff: Re: Windows-Share to Solr is not working properly
>
>
>Hi Alexander,
>
>It's hard to figure out exactly what you have configured from your email,
>but here are a couple of points:
>
>(1) ManifoldCF does not extract dates from binary files; it will only
>supply dates from file metadata.  So MCF is supplying the date from the
>modification date of the Windows file.
>(2) The JCIFS connector provides the same metadata date value in two ways:
>
>    rd.addField("lastModified", lastModifiedDate.toString());
>    rd.setModifiedDate(lastModifiedDate);
>
>This was done for backwards compatibility reasons.  You can control which
>metadata value name is used for the ModifiedDate field on the Solr
>connection's Schema tab.
>
>As for the "lastModified" data, you can either map that to a field you
>don't have in your solr schema, or you can suppress it entirely by creating
>an entry for Field Mapping that has "lastModified" on the left and a blank
>field on the right, and then clicking the "Add" button.  Bear in mind that
>1.5 had a bug in this functionality which was fixed in 1.5.1.
>
>Karl
>
>
>
>
>On Fri, Mar 28, 2014 at 10:13 AM, Alexander Stoffers <
>stoffers@modell-aachen.de> wrote:
>
>> Hi Karl,
>>
>> we have a problem with crawling documents out of a windows share to Solr.
>>
>> Our Solr schema has a date field that is not multivalued, but the output
>> of the crawled (e.g. pdf) document has a date array instead of a single
>> date.
>>
>> I tried to remove the the whole field with the tab "Solr Field Mapping",
>> using date=>'' but is not working at all. Can´t i remove the date metadata
>> at all?
>>
>> We figured out, that the crawler get´s the date metadata field out of the
>> binaries where we found a field, called ModDate. If we remove the ModDate
>> field out of the binaries the date metadata field disapears.
>>
>> Can you explain, why the crawler puts the ModDate twice in the date field
>> array?
>>
>>
>> Thank you in Advance
>> Alex
>>
>>
>>
>> --
>> --
>>
>> Dipl.-Wirt.-Ing. Alexander Stoffers
>> Leiter IT & Produktentwicklung
>> Modell Aachen GmbH - Interaktive Managementsysteme
>> Dennewartstr. 25-27, 52068 Aachen
>> fon ++49 176 1011 9752, fax ++49 241 9148 8653
>> http://www.modell-aachen.de
>>
>> Geschäftsführung: Dr.-Ing. Carsten Behrens
>> Amtsgericht Aachen, HRB 15622
>>
>> --
>>
>> Unseren IT-Support erreichen Sie unter
>> support@modell-aachen.de
>> +49 (0)241 53808720
>>
>

Mime
View raw message