manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Beneitez <gustavo.benei...@gmail.com>
Subject Re: [External] Re: Documents that didn't change are reindexed
Date Thu, 23 Aug 2018 12:33:41 GMT
Hi,

thanks everyone.

@Karl, many thanks I am going to write a little test and see what happens.

@Konrad, yes you are right, I think Liferay is creating something wrong
that might confuse the crawler. Let me write the test and see what it is.

Thanks!

El jue., 23 ago. 2018 a las 14:24, Holl, Konrad (<konrad.holl@accenture.com>)
escribió:

> Did you check the "modified" header returned with the documents from
> Liferay? Some systems tend to always use "now", which could explain the
> behavior (this might even be a configuration option). You can see this in a
> browser's debug window when you reload the page a couple of times (Ctrl+F5
> to force reloading).
>
>
> -Konrad
>
> ------------------------------
> *Von:* Karl Wright <daddywri@gmail.com>
> *Gesendet:* Donnerstag, 23. August 2018 14:18
> *An:* user@manifoldcf.apache.org
> *Betreff:* [External] Re: Documents that didn't change are reindexed
>
> I would suggest downloading the pages using curl a couple of times and
> comparing content.
> Headers also matter.  Here's the code:
>
> >>>>>>
>             // Calculate version from document data, which is presumed to
> be present.
>             StringBuilder sb = new StringBuilder();
>
>             // Acls
>             packList(sb,acls,'+');
>             if (acls.length > 0)
>             {
>               sb.append('+');
>               pack(sb,defaultAuthorityDenyToken,'+');
>             }
>             else
>               sb.append('-');
>
>             // Now, do the metadata.
>             Map<String,Set<String>> metaHash = new
> HashMap<String,Set<String>>();
>
>             String[] fixedListStrings = new String[2];
>             // They're all folded into the same part of the version string.
>             int headerCount = 0;
>             Iterator<String> headerIterator =
> fetchStatus.headerData.keySet().iterator();
>             while (headerIterator.hasNext())
>             {
>               String headerName = headerIterator.next();
>               String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
>               if (!reservedHeaders.contains(lowerHeaderName) &&
> !excludedHeaders.contains(lowerHeaderName))
>                 headerCount +=
> fetchStatus.headerData.get(headerName).size();
>             }
>             String[] fullMetadata = new String[headerCount];
>             headerCount = 0;
>             headerIterator = fetchStatus.headerData.keySet().iterator();
>             while (headerIterator.hasNext())
>             {
>               String headerName = headerIterator.next();
>               String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
>               if (!reservedHeaders.contains(lowerHeaderName) &&
> !excludedHeaders.contains(lowerHeaderName))
>               {
>                 Set<String> valueSet = metaHash.get(headerName);
>                 if (valueSet == null)
>                 {
>                   valueSet = new HashSet<String>();
>                   metaHash.put(headerName,valueSet);
>                 }
>                 List<String> headerValues =
> fetchStatus.headerData.get(headerName);
>                 for (String headerValue : headerValues)
>                 {
>                   valueSet.add(headerValue);
>                   fixedListStrings[0] = "header-"+headerName;
>                   fixedListStrings[1] = headerValue;
>                   StringBuilder newsb = new StringBuilder();
>                   packFixedList(newsb,fixedListStrings,'=');
>                   fullMetadata[headerCount++] = newsb.toString();
>                 }
>               }
>             }
>             java.util.Arrays.sort(fullMetadata);
>
>             packList(sb,fullMetadata,'+');
>             // Done with the parseable part!  Add the checksum.
>             sb.append(fetchStatus.checkSum);
>             // Add the filter version
>             sb.append("+");
>             sb.append(filterVersion);
>
>             String versionString = sb.toString();
> <<<<<<
>
> The "filter version" comes from your job specification and will change
> only if you change the job specification, but everything else should be
> self-explanatory.  Looks like all headers matter, so that could explain it.
>
> Karl
>
>
> On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
> Thanks Karl,
>
> I've been launching the job a couple of times with a small set of
> documents and what I see is that the elastic indexes every time each
> document, even though the weight of the document is always the same and I
> don't notice any "html dynamic content" like current time that could cause
> checksum to be different.
>
> Consulting the "Simple history" menu option shows that Elastic output
> connector is called
> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
> So I guess there is a miss-configuration somewhere...
>
>
>
> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<daddywri@gmail.com>)
> escribió:
>
> Hi Gustavo,
>
> I take it from your question that you are using the Web Connector?
>
> All connectors create a version string that is used to determine whether
> content needs to be reindexed or not.  The Web Connector's version string
> uses a checksum of the page contents; we found the "last modified" header
> to be unreliable, if I recall correctly.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
> Hi everyone,
>
> I am currently creating a job that indexes part of Liferay intranet
> content.
> Every time the job is executed the documents are fully reindexed in
> Elastic, no matter they didn't change.
> I thought I had read somewhere the crawler uses "last-modified" http
> header, but also that saves into database a hash.
> I was looking for the right one within the user's manual but no luck, so
> please could you tell me which is the correct one?
>
> Thanks in advance!
>
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>

Mime
View raw message