manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steph van Schalkwyk <st...@remcam.net>
Subject Re: PostgreSQL version to support MCF v2.10
Date Wed, 05 Sep 2018 16:10:06 GMT
Thank you. So I'll stop for now?
Steph




*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Wed, Sep 5, 2018 at 11:05 AM, Karl Wright <daddywri@gmail.com> wrote:

> I'm already working on the Web Connector.  The UI has problems that
> predate this change and I've alerted Kishore about them -- he'll look into
> them later today.
>
> Karl
>
>
> On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <steph@remcam.net>
> wrote:
>
>> Thank you Karl.
>> You are of course correct in that the incremental crawl is now broken in
>> that it does a full crawl every time.
>> I'll jump on the Web Connector and add that functionality.
>> Thanks for this excellent application and all the help over the years.
>> Steph
>>
>>
>>
>>
>> *Steph van Schalkwyk*
>> Principal, Remcam Search Engines
>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
>> <http://www.remcam.net/> Skype: svanschalkwyk
>> <https://mail.google.com/mail/u/0/#>
>> <http://linkedin.com/in/vanschalkwyk>
>>
>> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> The patch I uploaded doesn't work because the entire tab is broken;
>>> looks like the UI refactoring broke it and it was never reported.  Fixing
>>> now.
>>> Karl
>>>
>>>
>>> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I coded up the web connector feature I think we need.  See
>>>> CONNECTORS-1528; I've attached a patch.  Please apply and test it out to
>>>> see if it solves the case problem for your IIS site.
>>>>
>>>> For the "//" issue, can you be more specific about the mapping you need
>>>> to do?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Steph,
>>>>>
>>>>> Right, you wouldn't want to touch the framework.
>>>>>
>>>>> The effect of lower-casing the documentURI parameter in the
>>>>> addOrReplaceDocumentWithException method in an output connector would
>>>>> be to map multiple, independently-fetched, documents that differ only
by
>>>>> the case of the URL together into one document in the index.  The
>>>>> ManifoldCF assumption is that a document with a certain URI can be tracked
>>>>> in the index using exactly that URI.  Mapping the URI to lower case would
>>>>> break that assumption so the framework would make the wrong decision
in
>>>>> many cases.
>>>>>
>>>>> If you are picking up documents using the web connector, therefore,
>>>>> and you are getting duplicate documents because the document URLs are
>>>>> sloppy, it is therefore essential that INSTEAD of mapping the document
URI
>>>>> to lower case in the output connector, you map to lower case in the
>>>>> repository connector.  Otherwise the framework will not work right.
>>>>>
>>>>> There is a tab in the web connector that allows you to configure URL
>>>>> normalization, called "Canonicalization".  This would be a very appropriate
>>>>> place to add URL mapping to lower case.  It should be as simple as adding
>>>>> one more checkbox column in the table, and modifying the method that
does
>>>>> the URL processing to include lower-casing.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net>
>>>>> wrote:
>>>>>
>>>>>> Unless I have a massive misunderstanding somewhere...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Steph van Schalkwyk*
>>>>>> Principal, Remcam Search Engines
>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Karl
>>>>>>> I'm addressing it in the ES Output Connector.
>>>>>>> Not touching the framework :)
>>>>>>> S
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Let's make sure we're talking about the same thing.
>>>>>>>>
>>>>>>>> Here is the output connector method that receives the ID
(as the
>>>>>>>> documentURI parameter):
>>>>>>>>
>>>>>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>>>>>> VersionContext pipelineDescription, RepositoryDocument document,
String
>>>>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>>>>>
>>>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.
>>>>>>>> If you make it case insensitive in an output connector, this
will
>>>>>>>> potentially break a lot of things, for example incremental
indexing (which
>>>>>>>> organizes the last indexed version by document ID).
>>>>>>>>
>>>>>>>> I therefore highly recommend that any "sloppyness" in this
>>>>>>>> parameter be addressed in the Repository Connector that constructs
it.  If
>>>>>>>> the connector is crawling a repository that believes that
URLs are case
>>>>>>>> insensitive then it should map these IDs to lower case. 
If not, then it
>>>>>>>> shouldn't.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <
>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl.
>>>>>>>>> The issue is that the ES Output Connector uses the uri
to create
>>>>>>>>> the _id. When used with IIS which allows case variation
in the URI, it
>>>>>>>>> creates multiple documents. Clients on Windows IIS are
rarely cognizant of
>>>>>>>>> that issue as IIS is so lax in policing that OTB.
>>>>>>>>> Currently, every case variation in URI results in a new
doc in the
>>>>>>>>> index. This is only in the ES output connector.
>>>>>>>>> I can add an optional checkbox to do determien that particular
>>>>>>>>> action if that would help?
>>>>>>>>> Regards,
>>>>>>>>> Steph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype:
svanschalkwyk
>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> THanks for the update.
>>>>>>>>>> Lower-casing the ID would be fine except there are
some
>>>>>>>>>> connectors that care about case.  The web connector
is one such because
>>>>>>>>>> it's up to the web service to decide if case matters,
so the web connector
>>>>>>>>>> does not view urls with case differences as being
the same.  Other
>>>>>>>>>> connectors also will likely care as well. So I don't
think lower-casing the
>>>>>>>>>> document id is a smart thing to do.
>>>>>>>>>>
>>>>>>>>>> You could add this bit of configuration to the web
connector, if
>>>>>>>>>> that's what you are using, or to whatever other connector
constructs the ID.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk
<
>>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Karl.
>>>>>>>>>>>
>>>>>>>>>>> I'll look into that.
>>>>>>>>>>>
>>>>>>>>>>> Another note:
>>>>>>>>>>> Regarding the ES connector - I have made two
additions to it and
>>>>>>>>>>> should probably diff them for inclusion after
approval:
>>>>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I
have sloppy
>>>>>>>>>>> sources, particularly IIS...)
>>>>>>>>>>> 3. Added a "url" metadata field to the ES connector
(as ES 6.x
>>>>>>>>>>> does not allow accedd to _id in the schema anymore,
so no copy_field etc.
>>>>>>>>>>> from _id). Hence "url".
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Steph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking
some resource, and we
>>>>>>>>>>>> may need to upgrade it.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van
Schalkwyk <
>>>>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Olivier
>>>>>>>>>>>>> By all means.
>>>>>>>>>>>>> The only issue I have seen (totally unrelated)
is with Jetty,
>>>>>>>>>>>>> which has to be restarted about once
a week. Still trying to find the issue.
>>>>>>>>>>>>> I may be overly sensitive, but I suspect
MCF 2.10 with
>>>>>>>>>>>>> Postgres10 may be a bit slower. I have
no empiric evidence at the moment as
>>>>>>>>>>>>> I'm still delivering the project to UAT.
Will keep you posted.
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896
   steph@remcam.net
>>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svan
>>>>>>>>>>>>> schalkwyk <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier
Tavard <
>>>>>>>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL
configuration (sorry
>>>>>>>>>>>>>> for the late answer). I will test
it soon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph
van Schalkwyk <
>>>>>>>>>>>>>> steph@remcam.net> a écrit :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.
>>>>>>>>>>>>>> rhel7.x86_64.rpm
>>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.
>>>>>>>>>>>>>> x86_64.rpm
>>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4-
>>>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.
>>>>>>>>>>>>>> rhel7.x86_64.rpm
>>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4-
>>>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>>>>> - postgresql10
>>>>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>>>>> - { type: local, database: all, user:
postgres, auth_method:
>>>>>>>>>>>>>> peer }
>>>>>>>>>>>>>> - { type: local, database: all, user:
all, auth_method: peer
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '
>>>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 }
>>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '::1/128',
>>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '0.0.0.0/0
>>>>>>>>>>>>>> ', auth_method: md5 }
>>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '::0/0',
>>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories
| join(",") }}'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>>>>> value: 'on'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments)
* 16MB
>>>>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>>>>> value: '*'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: max_connections
>>>>>>>>>>>>>> value: '400'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>>>>> value: '900'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: datestyle
>>>>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>>>>> value: 'off'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # vacuum all databases every night
(full vacuum on Sunday
>>>>>>>>>>>>>> night, lazy vacuum every night)
>>>>>>>>>>>>>> - name: add postgresql cron lazy
vacuum
>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>>>>> hour: 8
>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb
--all --analyze --quiet'"
>>>>>>>>>>>>>> - name: add postgresql cron full
vacuum
>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>>> hour: 10
>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb
--all --full --analyze
>>>>>>>>>>>>>> --quiet'"
>>>>>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>> name: reindex
>>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>>> hour: 12
>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c
\"select datname from
>>>>>>>>>>>>>> pg_database order by datname;\" |
xargs -n 1 -I\"{}\" -- psql -U postgres
>>>>>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>>>>> Been running fine for some weeks
without user intervention.
>>>>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>

Mime
View raw message