manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: PostgreSQL version to support MCF v2.10
Date Tue, 04 Sep 2018 20:17:17 GMT
Hi Steph,

Right, you wouldn't want to touch the framework.

The effect of lower-casing the documentURI parameter in the
addOrReplaceDocumentWithException method in an output connector would be to
map multiple, independently-fetched, documents that differ only by the case
of the URL together into one document in the index.  The ManifoldCF
assumption is that a document with a certain URI can be tracked in the
index using exactly that URI.  Mapping the URI to lower case would break
that assumption so the framework would make the wrong decision in many
cases.

If you are picking up documents using the web connector, therefore, and you
are getting duplicate documents because the document URLs are sloppy, it is
therefore essential that INSTEAD of mapping the document URI to lower case
in the output connector, you map to lower case in the repository
connector.  Otherwise the framework will not work right.

There is a tab in the web connector that allows you to configure URL
normalization, called "Canonicalization".  This would be a very appropriate
place to add URL mapping to lower case.  It should be as simple as adding
one more checkbox column in the table, and modifying the method that does
the URL processing to include lower-casing.

Karl



On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net> wrote:

> Unless I have a massive misunderstanding somewhere...
>
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net>
> wrote:
>
>> Hi Karl
>> I'm addressing it in the ES Output Connector.
>> Not touching the framework :)
>> S
>>
>>
>>
>> *Steph van Schalkwyk*
>> Principal, Remcam Search Engines
>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
>> <http://www.remcam.net/> Skype: svanschalkwyk
>> <https://mail.google.com/mail/u/0/#>
>> <http://linkedin.com/in/vanschalkwyk>
>>
>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Let's make sure we're talking about the same thing.
>>>
>>> Here is the output connector method that receives the ID (as the
>>> documentURI parameter):
>>>
>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>> VersionContext pipelineDescription, RepositoryDocument document, String
>>> authorityNameString, IOutputAddActivity activities)
>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>
>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.  If
>>> you make it case insensitive in an output connector, this will potentially
>>> break a lot of things, for example incremental indexing (which organizes
>>> the last indexed version by document ID).
>>>
>>> I therefore highly recommend that any "sloppyness" in this parameter be
>>> addressed in the Repository Connector that constructs it.  If the connector
>>> is crawling a repository that believes that URLs are case insensitive then
>>> it should map these IDs to lower case.  If not, then it shouldn't.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
>>> wrote:
>>>
>>>> Hi Karl.
>>>> The issue is that the ES Output Connector uses the uri to create the
>>>> _id. When used with IIS which allows case variation in the URI, it creates
>>>> multiple documents. Clients on Windows IIS are rarely cognizant of that
>>>> issue as IIS is so lax in policing that OTB.
>>>> Currently, every case variation in URI results in a new doc in the
>>>> index. This is only in the ES output connector.
>>>> I can add an optional checkbox to do determien that particular action
>>>> if that would help?
>>>> Regards,
>>>> Steph
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Steph van Schalkwyk*
>>>> Principal, Remcam Search Engines
>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>> <https://mail.google.com/mail/u/0/#>
>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>
>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> THanks for the update.
>>>>> Lower-casing the ID would be fine except there are some connectors
>>>>> that care about case.  The web connector is one such because it's up
to the
>>>>> web service to decide if case matters, so the web connector does not
view
>>>>> urls with case differences as being the same.  Other connectors also
will
>>>>> likely care as well. So I don't think lower-casing the document id is
a
>>>>> smart thing to do.
>>>>>
>>>>> You could add this bit of configuration to the web connector, if
>>>>> that's what you are using, or to whatever other connector constructs
the ID.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <steph@remcam.net>
>>>>> wrote:
>>>>>
>>>>>> Thanks Karl.
>>>>>>
>>>>>> I'll look into that.
>>>>>>
>>>>>> Another note:
>>>>>> Regarding the ES connector - I have made two additions to it and
>>>>>> should probably diff them for inclusion after approval:
>>>>>> 1. lowercased _id (the doc URI).
>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources,
>>>>>> particularly IIS...)
>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does
>>>>>> not allow accedd to _id in the schema anymore, so no copy_field etc.
from
>>>>>> _id). Hence "url".
>>>>>>
>>>>>> Regards,
>>>>>> Steph
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Steph van Schalkwyk*
>>>>>> Principal, Remcam Search Engines
>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and
we may
>>>>>>> need to upgrade it.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <
>>>>>>> steph@remcam.net> wrote:
>>>>>>>
>>>>>>>> Olivier
>>>>>>>> By all means.
>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty,
which
>>>>>>>> has to be restarted about once a week. Still trying to find
the issue.
>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10
>>>>>>>> may be a bit slower. I have no empiric evidence at the moment
as I'm still
>>>>>>>> delivering the project to UAT. Will keep you posted.
>>>>>>>> Regards,
>>>>>>>> Steph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Steph van Schalkwyk*
>>>>>>>> Principal, Remcam Search Engines
>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration
(sorry for
>>>>>>>>> the late answer). I will test it soon.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Olivier TAVARD
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <steph@remcam.net>
a
>>>>>>>>> écrit :
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These are the rpm installs:
>>>>>>>>> -
>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>> -
>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>> -
>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>> -
>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>
>>>>>>>>> postgresql_version: 10
>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>> postgresql_packages:
>>>>>>>>> - postgresql10-libs
>>>>>>>>> - postgresql10
>>>>>>>>> - postgresql10-server
>>>>>>>>> - postgresql10-contrib
>>>>>>>>> # - postgresql10-devel
>>>>>>>>>
>>>>>>>>> postgresql_hba_entries:
>>>>>>>>> - { type: local, database: all, user: postgres, auth_method:
peer
>>>>>>>>> }
>>>>>>>>> - { type: local, database: all, user: all, auth_method:
peer }
>>>>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32',
>>>>>>>>> auth_method: md5 }
>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128',
>>>>>>>>> auth_method: md5 }
>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0',
>>>>>>>>> auth_method: md5 }
>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0',
>>>>>>>>> auth_method: md5 }
>>>>>>>>>
>>>>>>>>> postgresql_global_config_options:
>>>>>>>>> - option: unix_socket_directories
>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",")
}}'
>>>>>>>>>
>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>> value: 'on'
>>>>>>>>>
>>>>>>>>> - option: shared_buffers
>>>>>>>>> value: '1024MB'
>>>>>>>>>
>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>>>>> # checkpoint_segments=300
>>>>>>>>> - option: max_wal_size
>>>>>>>>> value: '14400MB'
>>>>>>>>>
>>>>>>>>> - option: min_wal_size
>>>>>>>>> value: '80MB'
>>>>>>>>>
>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>> value: '2MB'
>>>>>>>>>
>>>>>>>>> - option: listen_addresses
>>>>>>>>> value: '*'
>>>>>>>>>
>>>>>>>>> - option: max_connections
>>>>>>>>> value: '400'
>>>>>>>>>
>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>> value: '900'
>>>>>>>>>
>>>>>>>>> - option: datestyle
>>>>>>>>> value: "iso, mdy"
>>>>>>>>>
>>>>>>>>> - option: autovacuum
>>>>>>>>> value: 'off'
>>>>>>>>>
>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday
night,
>>>>>>>>> lazy vacuum every night)
>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>> cron:
>>>>>>>>> name: lazy_vacuum
>>>>>>>>> hour: 8
>>>>>>>>> minute: 0
>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>> cron:
>>>>>>>>> name: full_vacuum
>>>>>>>>> weekday: 0
>>>>>>>>> hour: 10
>>>>>>>>> minute: 0
>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze
--quiet'"
>>>>>>>>> # re-index all databases once a week
>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>> cron:
>>>>>>>>> name: reindex
>>>>>>>>> weekday: 0
>>>>>>>>> hour: 12
>>>>>>>>> minute: 0
>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from
>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\"
-- psql -U postgres
>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is how I run 2.10.
>>>>>>>>> Been running fine for some weeks without user intervention.
>>>>>>>>> @Karl: Any comments please?
>>>>>>>>> Steph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>

Mime
View raw message