manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job with Generic Connector stop to work
Date Fri, 06 May 2016 14:02:36 GMT
Hi Luca,

This approach causes each document's binary data to be read more than
once.  I think that is expensive, especially if there are a lot of values.
for a row.

Instead I think something more like ACLs will be needed -- that is, a
separate query for each multi-valued field.  This is more work but it would
work much better.

I will create a ticket to add this to the JDBC connector, but it won't
happen for a while.

Karl


On Fri, May 6, 2016 at 9:40 AM, Luca Alicata <alicataluca@gmail.com> wrote:

> I've decompile java connector and modified the code in this way:
>
> in process document, i see that just currently arrive all row of query
> result (also multi values row), but in the cycle that parse document, after
> first document with an ID, all the other with the same are skipped.
> So i removed the control that not permits to check other document with the
> same ID and i modified the method that store metadata, to permit to store
> multi value data as array in metadata mapping.
>
> I attached the code in this e-mail. You can find a comment that start with
> "---", that i insert know for you.
>
> Thanks,
> L. Alicata
>
> 2016-05-06 15:25 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>
>> Ok, it's now clear what you are looking for, but it is still not clear
>> how we'd integrate that in the JDBC connector.  How did you do this when
>> you modified the connector for 1.8?
>>
>> Karl
>>
>>
>> On Fri, May 6, 2016 at 9:21 AM, Luca Alicata <alicataluca@gmail.com>
>> wrote:
>>
>>> Hi Karl,
>>> sorry for my english :).
>>> I mean the fact that i've to extract value from query with a join
>>> between two table with a relationship of one-to-many, the dataset returned
>>> from Connector is only one pair from the two table.
>>>
>>> For example:
>>> Table A with persons
>>> Table B with eyes
>>>
>>> As result of join, i aspect have two row like:
>>> person 1, eye left
>>> person 1, eye right
>>>
>>> but the connector returns only one row:
>>> person 1, eye left
>>>
>>> I hope now it's more clear.
>>>
>>> Ps. i report the phrase on Manifold documentation that explain that (
>>> https://manifoldcf.apache.org/release/release-2.3/en_US/end-user-documentation.html#jdbcrepository
>>> ):
>>> ------
>>> There is currently no support in the JDBC connection type for natively
>>> handling multi-valued metadata.
>>> ------
>>>
>>> Thanks,
>>> L. Alicata
>>>
>>>
>>> 2016-05-06 15:10 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>
>>>> Hi Luca,
>>>>
>>>> It is not clear what you mean by "multi value extraction" using the
>>>> JDBC connector.  The JDBC connector allows collection of primary binary
>>>> content as well as metadata from a database row.  So maybe if you can
>>>> explain what you need beyond that it would help.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, May 6, 2016 at 9:04 AM, Luca Alicata <alicataluca@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>> thanks for information, fortunately in other jboss instance i have a
>>>>> old Manifold configuration with single process, that i've dismissed.
But in
>>>>> this moment, i start to test this jobs with that and if it work fine,
i can
>>>>> use it only for this job and use it also in production. Maybe after,
if i
>>>>> can, i try to check the possible problem that stop the agent.
>>>>>
>>>>> I Take advantage of this discussion to ask you, if multi-value
>>>>> extraction from db is consider as possible future work or no. Because
i've
>>>>> used this generi connector to resolve this lack of JDBC Connector. In
fact
>>>>> with Manifold 1.8 i've modified the connector to support this behavior
(in
>>>>> addiction to parse blob file), but upgrade Manifold Version, to not rewrite
>>>>> the new connector i decide to use Generic Connector with application
that
>>>>> do the work of extraction data from DB.
>>>>>
>>>>> Thanks,
>>>>> L. Alicata
>>>>>
>>>>> 2016-05-06 14:42 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>
>>>>>> Hi Luca,
>>>>>>
>>>>>> If you do a lock clean and the process still stops, then the locks
>>>>>> are not the problem.
>>>>>>
>>>>>> One way we can drill down into the problem is to get a thread dump
of
>>>>>> the agents process after it stops.  The thread dump must be of the
agents
>>>>>> process, not any of the others.
>>>>>>
>>>>>> FWIW, the generic connector is not well supported; the person who
>>>>>> wrote it is still a committer but is not actively involved in MCF
>>>>>> development at this time.  I suspect that the problem may have to
do with
>>>>>> how that connector deals with exceptions or errors, but I am not
sure.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, May 6, 2016 at 8:38 AM, Luca Alicata <alicataluca@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>> I've just tried with lock-clean after agents stop to work, obviously
>>>>>>> after stopping process. After this, job start correctly, but
just second
>>>>>>> time that i start a job with a lot of data (or sometimes the
third time),
>>>>>>> agent stop again.
>>>>>>>
>>>>>>> Unfortunately, it's difficult start, for the moment, to using
>>>>>>> Zookeeper in this environment, but this can resolve the fact
that during
>>>>>>> working agents stop to work? or help only for cleaning lock agent
when i
>>>>>>> restart the process?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> L. Alicata
>>>>>>>
>>>>>>> 2016-05-06 14:15 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>
>>>>>>>> Hi Luca,
>>>>>>>>
>>>>>>>> With file-based synchronization, if you kill any of the processes
>>>>>>>> involved, you will need to execute the lock-clean procedure
to make sure
>>>>>>>> you have no dangling locks in the file system.
>>>>>>>>
>>>>>>>> - shut down all MCF processes (except the database)
>>>>>>>> - run the lock-clean script
>>>>>>>> - start your MCF processes back up
>>>>>>>>
>>>>>>>> I suspect what you are seeing is related to this.
>>>>>>>>
>>>>>>>> Also, please consider using Zookeeper instead, since it is
more
>>>>>>>> robust about cleaning out dangling locks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 8:06 AM, Luca Alicata <alicataluca@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>> thanks for help.
>>>>>>>>> In my case i've only one instance of MCF running, with
both type
>>>>>>>>> of job (SP and Generic), and so i have only one properties
files (that i
>>>>>>>>> have attached).
>>>>>>>>> For information i used (multiprocess-file configuration)
with
>>>>>>>>> postgres.
>>>>>>>>>
>>>>>>>>> Do you have other suggestions? do you need more information,
that
>>>>>>>>> i can give you?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> L.Alicata
>>>>>>>>>
>>>>>>>>> 2016-05-06 12:55 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi Luca,
>>>>>>>>>>
>>>>>>>>>> Do you have multiple independent MCF clusters running
at the same
>>>>>>>>>> time?  It sounds like you do: you have SP on one,
and Generic on another.
>>>>>>>>>> If so, you will need to be sure that the synchronization
you are using
>>>>>>>>>> (either zookeeper or file-based) does not overlap.
 Each cluster needs its
>>>>>>>>>> own synchronization.  If there is overlap, then doing
things with one
>>>>>>>>>> cluster may cause the other cluster to hang.  This
also means you have to
>>>>>>>>>> have different properties files for the two clusters,
of course.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 4:32 AM, Luca Alicata <
>>>>>>>>>> alicataluca@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> i'm using Manifold 2.2 with multi-process configuration
in Jboss
>>>>>>>>>>> instance inside a Windows Server 2012 and i've
a set of job that work with
>>>>>>>>>>> Sharepoint (SP) or Generic Connector (GC), that
get file from a db.
>>>>>>>>>>> With SP i've no problem, while with GC with a
lot of document
>>>>>>>>>>> (one with 47k and another with 60k), the Seed
taking process, sometimes,
>>>>>>>>>>> not finish, because the agents seem to stop (although
java process is still
>>>>>>>>>>> alive).
>>>>>>>>>>> After this, if i try to start any other job,
that not start,
>>>>>>>>>>> like the agents are stopped.
>>>>>>>>>>>
>>>>>>>>>>> Other times, this jobs work correctly and one
time together work
>>>>>>>>>>> correctly, running in the same moment.
>>>>>>>>>>>
>>>>>>>>>>> For information:
>>>>>>>>>>>
>>>>>>>>>>>    - On Jboss there are only Manifold and Generic
Repository
>>>>>>>>>>>    application.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - On the same Virtual Server, there is another
Jboss
>>>>>>>>>>>    istance, with solr istance and a web application.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - I've check if it was a type of memory problem,
but it's
>>>>>>>>>>>    not the case.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - GC with almost 23k seed work always, at
least in test that
>>>>>>>>>>>    i've done.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - In local instance of Jboss with Manifold
and Generic
>>>>>>>>>>>    Rpository Application, i've not keep this
problem.
>>>>>>>>>>>
>>>>>>>>>>> This is the only recurrent information that i've
seen on
>>>>>>>>>>> manifold.log:
>>>>>>>>>>> ---------------
>>>>>>>>>>> Connection 0.0.0.0:62755<-><ip-address>:<port>
shut down
>>>>>>>>>>> Releasing connection
>>>>>>>>>>> org.apache.http.impl.conn.ManagedClientConnectionImpl@6c98c1bd
>>>>>>>>>>>
>>>>>>>>>>> ---------------
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> L. Alicata
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message