manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Documents that didn't change are reindexed
Date Thu, 23 Aug 2018 12:18:15 GMT
I would suggest downloading the pages using curl a couple of times and
comparing content.
Headers also matter.  Here's the code:

>>>>>>
            // Calculate version from document data, which is presumed to
be present.
            StringBuilder sb = new StringBuilder();

            // Acls
            packList(sb,acls,'+');
            if (acls.length > 0)
            {
              sb.append('+');
              pack(sb,defaultAuthorityDenyToken,'+');
            }
            else
              sb.append('-');

            // Now, do the metadata.
            Map<String,Set<String>> metaHash = new
HashMap<String,Set<String>>();

            String[] fixedListStrings = new String[2];
            // They're all folded into the same part of the version string.
            int headerCount = 0;
            Iterator<String> headerIterator =
fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) &&
!excludedHeaders.contains(lowerHeaderName))
                headerCount +=
fetchStatus.headerData.get(headerName).size();
            }
            String[] fullMetadata = new String[headerCount];
            headerCount = 0;
            headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) &&
!excludedHeaders.contains(lowerHeaderName))
              {
                Set<String> valueSet = metaHash.get(headerName);
                if (valueSet == null)
                {
                  valueSet = new HashSet<String>();
                  metaHash.put(headerName,valueSet);
                }
                List<String> headerValues =
fetchStatus.headerData.get(headerName);
                for (String headerValue : headerValues)
                {
                  valueSet.add(headerValue);
                  fixedListStrings[0] = "header-"+headerName;
                  fixedListStrings[1] = headerValue;
                  StringBuilder newsb = new StringBuilder();
                  packFixedList(newsb,fixedListStrings,'=');
                  fullMetadata[headerCount++] = newsb.toString();
                }
              }
            }
            java.util.Arrays.sort(fullMetadata);

            packList(sb,fullMetadata,'+');
            // Done with the parseable part!  Add the checksum.
            sb.append(fetchStatus.checkSum);
            // Add the filter version
            sb.append("+");
            sb.append(filterVersion);

            String versionString = sb.toString();
<<<<<<

The "filter version" comes from your job specification and will change only
if you change the job specification, but everything else should be
self-explanatory.  Looks like all headers matter, so that could explain it.

Karl


On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <gustavo.beneitez@gmail.com>
wrote:

> Thanks Karl,
>
> I've been launching the job a couple of times with a small set of
> documents and what I see is that the elastic indexes every time each
> document, even though the weight of the document is always the same and I
> don't notice any "html dynamic content" like current time that could cause
> checksum to be different.
>
> Consulting the "Simple history" menu option shows that Elastic output
> connector is called
> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
> So I guess there is a miss-configuration somewhere...
>
>
>
> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<daddywri@gmail.com>)
> escribió:
>
>> Hi Gustavo,
>>
>> I take it from your question that you are using the Web Connector?
>>
>> All connectors create a version string that is used to determine whether
>> content needs to be reindexed or not.  The Web Connector's version string
>> uses a checksum of the page contents; we found the "last modified" header
>> to be unreliable, if I recall correctly.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I am currently creating a job that indexes part of Liferay intranet
>>> content.
>>> Every time the job is executed the documents are fully reindexed in
>>> Elastic, no matter they didn't change.
>>> I thought I had read somewhere the crawler uses "last-modified" http
>>> header, but also that saves into database a hash.
>>> I was looking for the right one within the user's manual but no luck, so
>>> please could you tell me which is the correct one?
>>>
>>> Thanks in advance!
>>>
>>

Mime
View raw message