manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steph van Schalkwyk <st...@remcam.net>
Subject Re: PostgreSQL version to support MCF v2.10
Date Tue, 04 Sep 2018 18:42:27 GMT
Hi Karl
I'm addressing it in the ES Output Connector.
Not touching the framework :)
S



*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com> wrote:

> Let's make sure we're talking about the same thing.
>
> Here is the output connector method that receives the ID (as the
> documentURI parameter):
>
>   public int addOrReplaceDocumentWithException(String documentURI,
> VersionContext pipelineDescription, RepositoryDocument document, String
> authorityNameString, IOutputAddActivity activities)
>     throws ManifoldCFException, ServiceInterruption, IOException;
>
> ManifoldCF doesn't say anywhere that this ID is case insensitive.  If you
> make it case insensitive in an output connector, this will potentially
> break a lot of things, for example incremental indexing (which organizes
> the last indexed version by document ID).
>
> I therefore highly recommend that any "sloppyness" in this parameter be
> addressed in the Repository Connector that constructs it.  If the connector
> is crawling a repository that believes that URLs are case insensitive then
> it should map these IDs to lower case.  If not, then it shouldn't.
>
> Karl
>
>
> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
> wrote:
>
>> Hi Karl.
>> The issue is that the ES Output Connector uses the uri to create the _id.
>> When used with IIS which allows case variation in the URI, it creates
>> multiple documents. Clients on Windows IIS are rarely cognizant of that
>> issue as IIS is so lax in policing that OTB.
>> Currently, every case variation in URI results in a new doc in the index.
>> This is only in the ES output connector.
>> I can add an optional checkbox to do determien that particular action if
>> that would help?
>> Regards,
>> Steph
>>
>>
>>
>>
>>
>> *Steph van Schalkwyk*
>> Principal, Remcam Search Engines
>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
>> <http://www.remcam.net/> Skype: svanschalkwyk
>> <https://mail.google.com/mail/u/0/#>
>> <http://linkedin.com/in/vanschalkwyk>
>>
>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> THanks for the update.
>>> Lower-casing the ID would be fine except there are some connectors that
>>> care about case.  The web connector is one such because it's up to the web
>>> service to decide if case matters, so the web connector does not view urls
>>> with case differences as being the same.  Other connectors also will likely
>>> care as well. So I don't think lower-casing the document id is a smart
>>> thing to do.
>>>
>>> You could add this bit of configuration to the web connector, if that's
>>> what you are using, or to whatever other connector constructs the ID.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <steph@remcam.net>
>>> wrote:
>>>
>>>> Thanks Karl.
>>>>
>>>> I'll look into that.
>>>>
>>>> Another note:
>>>> Regarding the ES connector - I have made two additions to it and should
>>>> probably diff them for inclusion after approval:
>>>> 1. lowercased _id (the doc URI).
>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources,
>>>> particularly IIS...)
>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does not
>>>> allow accedd to _id in the schema anymore, so no copy_field etc. from _id).
>>>> Hence "url".
>>>>
>>>> Regards,
>>>> Steph
>>>>
>>>>
>>>>
>>>>
>>>> *Steph van Schalkwyk*
>>>> Principal, Remcam Search Engines
>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>> <https://mail.google.com/mail/u/0/#>
>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>
>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may
>>>>> need to upgrade it.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <steph@remcam.net>
>>>>> wrote:
>>>>>
>>>>>> Olivier
>>>>>> By all means.
>>>>>> The only issue I have seen (totally unrelated) is with Jetty, which
>>>>>> has to be restarted about once a week. Still trying to find the issue.
>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10
may
>>>>>> be a bit slower. I have no empiric evidence at the moment as I'm
still
>>>>>> delivering the project to UAT. Will keep you posted.
>>>>>> Regards,
>>>>>> Steph
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Steph van Schalkwyk*
>>>>>> Principal, Remcam Search Engines
>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry
for
>>>>>>> the late answer). I will test it soon.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>>
>>>>>>> Olivier TAVARD
>>>>>>>
>>>>>>>
>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <steph@remcam.net>
a
>>>>>>> écrit :
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> These are the rpm installs:
>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.
>>>>>>> rhel7.x86_64.rpm
>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4-
>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.
>>>>>>> rhel7.x86_64.rpm
>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4-
>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>
>>>>>>> postgresql_version: 10
>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>> postgresql_packages:
>>>>>>> - postgresql10-libs
>>>>>>> - postgresql10
>>>>>>> - postgresql10-server
>>>>>>> - postgresql10-contrib
>>>>>>> # - postgresql10-devel
>>>>>>>
>>>>>>> postgresql_hba_entries:
>>>>>>> - { type: local, database: all, user: postgres, auth_method:
peer }
>>>>>>> - { type: local, database: all, user: all, auth_method: peer
}
>>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32',
>>>>>>> auth_method: md5 }
>>>>>>> - { type: host, database: all, user: all, address: '::1/128',
>>>>>>> auth_method: md5 }
>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0',
>>>>>>> auth_method: md5 }
>>>>>>> - { type: host, database: all, user: all, address: '::0/0',
>>>>>>> auth_method: md5 }
>>>>>>>
>>>>>>> postgresql_global_config_options:
>>>>>>> - option: unix_socket_directories
>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}'
>>>>>>>
>>>>>>> - option: standard_conforming_strings
>>>>>>> value: 'on'
>>>>>>>
>>>>>>> - option: shared_buffers
>>>>>>> value: '1024MB'
>>>>>>>
>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>>> # checkpoint_segments=300
>>>>>>> - option: max_wal_size
>>>>>>> value: '14400MB'
>>>>>>>
>>>>>>> - option: min_wal_size
>>>>>>> value: '80MB'
>>>>>>>
>>>>>>> - option: maintenance_work_mem
>>>>>>> value: '2MB'
>>>>>>>
>>>>>>> - option: listen_addresses
>>>>>>> value: '*'
>>>>>>>
>>>>>>> - option: max_connections
>>>>>>> value: '400'
>>>>>>>
>>>>>>> - option: checkpoint_timeout
>>>>>>> value: '900'
>>>>>>>
>>>>>>> - option: datestyle
>>>>>>> value: "iso, mdy"
>>>>>>>
>>>>>>> - option: autovacuum
>>>>>>> value: 'off'
>>>>>>>
>>>>>>> # vacuum all databases every night (full vacuum on Sunday night,
>>>>>>> lazy vacuum every night)
>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>> cron:
>>>>>>> name: lazy_vacuum
>>>>>>> hour: 8
>>>>>>> minute: 0
>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
>>>>>>> - name: add postgresql cron full vacuum
>>>>>>> cron:
>>>>>>> name: full_vacuum
>>>>>>> weekday: 0
>>>>>>> hour: 10
>>>>>>> minute: 0
>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'"
>>>>>>> # re-index all databases once a week
>>>>>>> - name: add postgresql cron reindex
>>>>>>> cron:
>>>>>>> name: reindex
>>>>>>> weekday: 0
>>>>>>> hour: 12
>>>>>>> minute: 0
>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from
>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql
-U postgres
>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>
>>>>>>>
>>>>>>> This is how I run 2.10.
>>>>>>> Been running fine for some weeks without user intervention.
>>>>>>> @Karl: Any comments please?
>>>>>>> Steph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>

Mime
View raw message