manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Holl, Konrad" <konrad.h...@accenture.com>
Subject AW: [External] Re: Documents that didn't change are reindexed
Date Thu, 23 Aug 2018 12:24:20 GMT
Did you check the "modified" header returned with the documents from Liferay? Some systems
tend to always use "now", which could explain the behavior (this might even be a configuration
option). You can see this in a browser's debug window when you reload the page a couple of
times (Ctrl+F5 to force reloading).


-Konrad

________________________________
Von: Karl Wright <daddywri@gmail.com>
Gesendet: Donnerstag, 23. August 2018 14:18
An: user@manifoldcf.apache.org
Betreff: [External] Re: Documents that didn't change are reindexed

I would suggest downloading the pages using curl a couple of times and comparing content.
Headers also matter.  Here's the code:

>>>>>>
            // Calculate version from document data, which is presumed to be present.
            StringBuilder sb = new StringBuilder();

            // Acls
            packList(sb,acls,'+');
            if (acls.length > 0)
            {
              sb.append('+');
              pack(sb,defaultAuthorityDenyToken,'+');
            }
            else
              sb.append('-');

            // Now, do the metadata.
            Map<String,Set<String>> metaHash = new HashMap<String,Set<String>>();

            String[] fixedListStrings = new String[2];
            // They're all folded into the same part of the version string.
            int headerCount = 0;
            Iterator<String> headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && !excludedHeaders.contains(lowerHeaderName))
                headerCount += fetchStatus.headerData.get(headerName).size();
            }
            String[] fullMetadata = new String[headerCount];
            headerCount = 0;
            headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && !excludedHeaders.contains(lowerHeaderName))
              {
                Set<String> valueSet = metaHash.get(headerName);
                if (valueSet == null)
                {
                  valueSet = new HashSet<String>();
                  metaHash.put(headerName,valueSet);
                }
                List<String> headerValues = fetchStatus.headerData.get(headerName);
                for (String headerValue : headerValues)
                {
                  valueSet.add(headerValue);
                  fixedListStrings[0] = "header-"+headerName;
                  fixedListStrings[1] = headerValue;
                  StringBuilder newsb = new StringBuilder();
                  packFixedList(newsb,fixedListStrings,'=');
                  fullMetadata[headerCount++] = newsb.toString();
                }
              }
            }
            java.util.Arrays.sort(fullMetadata);

            packList(sb,fullMetadata,'+');
            // Done with the parseable part!  Add the checksum.
            sb.append(fetchStatus.checkSum);
            // Add the filter version
            sb.append("+");
            sb.append(filterVersion);

            String versionString = sb.toString();
<<<<<<

The "filter version" comes from your job specification and will change only if you change
the job specification, but everything else should be self-explanatory.  Looks like all headers
matter, so that could explain it.

Karl


On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <gustavo.beneitez@gmail.com<mailto:gustavo.beneitez@gmail.com>>
wrote:
Thanks Karl,

I've been launching the job a couple of times with a small set of documents and what I see
is that the elastic indexes every time each document, even though the weight of the document
is always the same and I don't notice any "html dynamic content" like current time that could
cause checksum to be different.

Consulting the "Simple history" menu option shows that Elastic output connector is called
"08-23-2018 06:27:19.274        Indexation (Elasticsearch 2.4.6)"

So I guess there is a miss-configuration somewhere...



El jue., 23 ago. 2018 a las 1:45, Karl Wright (<daddywri@gmail.com<mailto:daddywri@gmail.com>>)
escribió:
Hi Gustavo,

I take it from your question that you are using the Web Connector?

All connectors create a version string that is used to determine whether content needs to
be reindexed or not.  The Web Connector's version string uses a checksum of the page contents;
we found the "last modified" header to be unreliable, if I recall correctly.

Thanks,
Karl


On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <gustavo.beneitez@gmail.com<mailto:gustavo.beneitez@gmail.com>>
wrote:
Hi everyone,

I am currently creating a job that indexes part of Liferay intranet content.
Every time the job is executed the documents are fully reindexed in Elastic, no matter they
didn't change.
I thought I had read somewhere the crawler uses "last-modified" http header, but also that
saves into database a hash.
I was looking for the right one within the user's manual but no luck, so please could you
tell me which is the correct one?

Thanks in advance!

________________________________

This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise confidential information. If you have received it in error, please notify the
sender immediately and delete the original. Any other use of the e-mail by you is prohibited.
Where allowed by local law, electronic communications with Accenture and its affiliates, including
e-mail and instant messaging (including content), may be scanned by our systems for the purposes
of information security and assessment of internal compliance with Accenture policy. Your
privacy is important to us. Accenture uses your personal data only in compliance with data
protection laws. For further information on how Accenture processes your personal data, please
see our privacy statement at https://www.accenture.com/us-en/privacy-policy.
______________________________________________________________________________________

www.accenture.com

Mime
View raw message