manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Beneitez <gustavo.benei...@gmail.com>
Subject Re: Documents that didn't change are reindexed
Date Thu, 23 Aug 2018 09:56:03 GMT
Thanks Karl,

I've been launching the job a couple of times with a small set of documents
and what I see is that the elastic indexes every time each document, even
though the weight of the document is always the same and I don't notice any
"html dynamic content" like current time that could cause checksum to be
different.

Consulting the "Simple history" menu option shows that Elastic output
connector is called
"08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
So I guess there is a miss-configuration somewhere...



El jue., 23 ago. 2018 a las 1:45, Karl Wright (<daddywri@gmail.com>)
escribió:

> Hi Gustavo,
>
> I take it from your question that you are using the Web Connector?
>
> All connectors create a version string that is used to determine whether
> content needs to be reindexed or not.  The Web Connector's version string
> uses a checksum of the page contents; we found the "last modified" header
> to be unreliable, if I recall correctly.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am currently creating a job that indexes part of Liferay intranet
>> content.
>> Every time the job is executed the documents are fully reindexed in
>> Elastic, no matter they didn't change.
>> I thought I had read somewhere the crawler uses "last-modified" http
>> header, but also that saves into database a hash.
>> I was looking for the right one within the user's manual but no luck, so
>> please could you tell me which is the correct one?
>>
>> Thanks in advance!
>>
>

Mime
View raw message