Did you check the "modified" header returned with the documents from Liferay? Some systems tend to always use "now", which could explain the behavior (this might even be a configuration option). You can see this in a browser's debug window when you reload the page a couple of times (Ctrl+F5 to force reloading).


Von: Karl Wright <daddywri@gmail.com>
Gesendet: Donnerstag, 23. August 2018 14:18
An: user@manifoldcf.apache.org
Betreff: [External] Re: Documents that didn't change are reindexed
I would suggest downloading the pages using curl a couple of times and comparing content.
Headers also matter.  Here's the code:

            // Calculate version from document data, which is presumed to be present.
            StringBuilder sb = new StringBuilder();

            // Acls
            if (acls.length > 0)

            // Now, do the metadata. 
            Map<String,Set<String>> metaHash = new HashMap<String,Set<String>>();
            String[] fixedListStrings = new String[2];
            // They're all folded into the same part of the version string.
            int headerCount = 0;
            Iterator<String> headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && !excludedHeaders.contains(lowerHeaderName))
                headerCount += fetchStatus.headerData.get(headerName).size();
            String[] fullMetadata = new String[headerCount];
            headerCount = 0;
            headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && !excludedHeaders.contains(lowerHeaderName))
                Set<String> valueSet = metaHash.get(headerName);
                if (valueSet == null)
                  valueSet = new HashSet<String>();
                List<String> headerValues = fetchStatus.headerData.get(headerName);
                for (String headerValue : headerValues)
                  fixedListStrings[0] = "header-"+headerName;
                  fixedListStrings[1] = headerValue;
                  StringBuilder newsb = new StringBuilder();
                  fullMetadata[headerCount++] = newsb.toString();
            // Done with the parseable part!  Add the checksum.
            // Add the filter version
            String versionString = sb.toString();

The "filter version" comes from your job specification and will change only if you change the job specification, but everything else should be self-explanatory.  Looks like all headers matter, so that could explain it.


On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <gustavo.beneitez@gmail.com> wrote:
Thanks Karl,

I've been launching the job a couple of times with a small set of documents and what I see is that the elastic indexes every time each document, even though the weight of the document is always the same and I don't notice any "html dynamic content" like current time that could cause checksum to be different.

Consulting the "Simple history" menu option shows that Elastic output connector is called
"08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"

So I guess there is a miss-configuration somewhere...

El jue., 23 ago. 2018 a las 1:45, Karl Wright (<daddywri@gmail.com>) escribió:
Hi Gustavo,

I take it from your question that you are using the Web Connector?

All connectors create a version string that is used to determine whether content needs to be reindexed or not.  The Web Connector's version string uses a checksum of the page contents; we found the "last modified" header to be unreliable, if I recall correctly.


On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <gustavo.beneitez@gmail.com> wrote:
Hi everyone,

I am currently creating a job that indexes part of Liferay intranet content.
Every time the job is executed the documents are fully reindexed in Elastic, no matter they didn't change.
I thought I had read somewhere the crawler uses "last-modified" http header, but also that saves into database a hash.
I was looking for the right one within the user's manual but no luck, so please could you tell me which is the correct one?

Thanks in advance!

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. Your privacy is important to us. Accenture uses your personal data only in compliance with data protection laws. For further information on how Accenture processes your personal data, please see our privacy statement at https://www.accenture.com/us-en/privacy-policy.