manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job with Generic Connector stop to work
Date Mon, 09 May 2016 16:04:09 GMT
Hi Luca,

I've put together code that should allow multivalued attributes to be
crawled.  In order to try it, you will need to check out the
CONNECTORS-1313 branch:

svn checkout
https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1313

Then, build:

ant make-core-deps
ant build

Please give this a try and see if it works for you.

Thanks,
Karl


On Fri, May 6, 2016 at 10:15 AM, Luca Alicata <alicataluca@gmail.com> wrote:

> Hi Karl,
> I can confirm that it is a little expensive, but at that time, i haven't
> much time, and i stop to work after found the solution.
> Thanks for the creation of the ticket, for the moment, i try to use
> generic connector.
>
> An other question, there is another connector that can use an application
> to receive data? Like GenericConnector?
>
> Thanks,
> L. Alicata
>
> 2016-05-06 16:02 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>
>> Hi Luca,
>>
>> This approach causes each document's binary data to be read more than
>> once.  I think that is expensive, especially if there are a lot of values.
>> for a row.
>>
>> Instead I think something more like ACLs will be needed -- that is, a
>> separate query for each multi-valued field.  This is more work but it would
>> work much better.
>>
>> I will create a ticket to add this to the JDBC connector, but it won't
>> happen for a while.
>>
>> Karl
>>
>>
>> On Fri, May 6, 2016 at 9:40 AM, Luca Alicata <alicataluca@gmail.com>
>> wrote:
>>
>>> I've decompile java connector and modified the code in this way:
>>>
>>> in process document, i see that just currently arrive all row of query
>>> result (also multi values row), but in the cycle that parse document, after
>>> first document with an ID, all the other with the same are skipped.
>>> So i removed the control that not permits to check other document with
>>> the same ID and i modified the method that store metadata, to permit to
>>> store multi value data as array in metadata mapping.
>>>
>>> I attached the code in this e-mail. You can find a comment that start
>>> with "---", that i insert know for you.
>>>
>>> Thanks,
>>> L. Alicata
>>>
>>> 2016-05-06 15:25 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>
>>>> Ok, it's now clear what you are looking for, but it is still not clear
>>>> how we'd integrate that in the JDBC connector.  How did you do this when
>>>> you modified the connector for 1.8?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, May 6, 2016 at 9:21 AM, Luca Alicata <alicataluca@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>> sorry for my english :).
>>>>> I mean the fact that i've to extract value from query with a join
>>>>> between two table with a relationship of one-to-many, the dataset returned
>>>>> from Connector is only one pair from the two table.
>>>>>
>>>>> For example:
>>>>> Table A with persons
>>>>> Table B with eyes
>>>>>
>>>>> As result of join, i aspect have two row like:
>>>>> person 1, eye left
>>>>> person 1, eye right
>>>>>
>>>>> but the connector returns only one row:
>>>>> person 1, eye left
>>>>>
>>>>> I hope now it's more clear.
>>>>>
>>>>> Ps. i report the phrase on Manifold documentation that explain that (
>>>>> https://manifoldcf.apache.org/release/release-2.3/en_US/end-user-documentation.html#jdbcrepository
>>>>> ):
>>>>> ------
>>>>> There is currently no support in the JDBC connection type for natively
>>>>> handling multi-valued metadata.
>>>>> ------
>>>>>
>>>>> Thanks,
>>>>> L. Alicata
>>>>>
>>>>>
>>>>> 2016-05-06 15:10 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>
>>>>>> Hi Luca,
>>>>>>
>>>>>> It is not clear what you mean by "multi value extraction" using the
>>>>>> JDBC connector.  The JDBC connector allows collection of primary
binary
>>>>>> content as well as metadata from a database row.  So maybe if you
can
>>>>>> explain what you need beyond that it would help.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, May 6, 2016 at 9:04 AM, Luca Alicata <alicataluca@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>> thanks for information, fortunately in other jboss instance i
have a
>>>>>>> old Manifold configuration with single process, that i've dismissed.
But in
>>>>>>> this moment, i start to test this jobs with that and if it work
fine, i can
>>>>>>> use it only for this job and use it also in production. Maybe
after, if i
>>>>>>> can, i try to check the possible problem that stop the agent.
>>>>>>>
>>>>>>> I Take advantage of this discussion to ask you, if multi-value
>>>>>>> extraction from db is consider as possible future work or no.
Because i've
>>>>>>> used this generi connector to resolve this lack of JDBC Connector.
In fact
>>>>>>> with Manifold 1.8 i've modified the connector to support this
behavior (in
>>>>>>> addiction to parse blob file), but upgrade Manifold Version,
to not rewrite
>>>>>>> the new connector i decide to use Generic Connector with application
that
>>>>>>> do the work of extraction data from DB.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> L. Alicata
>>>>>>>
>>>>>>> 2016-05-06 14:42 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>
>>>>>>>> Hi Luca,
>>>>>>>>
>>>>>>>> If you do a lock clean and the process still stops, then
the locks
>>>>>>>> are not the problem.
>>>>>>>>
>>>>>>>> One way we can drill down into the problem is to get a thread
dump
>>>>>>>> of the agents process after it stops.  The thread dump must
be of the
>>>>>>>> agents process, not any of the others.
>>>>>>>>
>>>>>>>> FWIW, the generic connector is not well supported; the person
who
>>>>>>>> wrote it is still a committer but is not actively involved
in MCF
>>>>>>>> development at this time.  I suspect that the problem may
have to do with
>>>>>>>> how that connector deals with exceptions or errors, but I
am not sure.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 8:38 AM, Luca Alicata <alicataluca@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>> I've just tried with lock-clean after agents stop to
work,
>>>>>>>>> obviously after stopping process. After this, job start
correctly, but just
>>>>>>>>> second time that i start a job with a lot of data (or
sometimes the third
>>>>>>>>> time), agent stop again.
>>>>>>>>>
>>>>>>>>> Unfortunately, it's difficult start, for the moment,
to using
>>>>>>>>> Zookeeper in this environment, but this can resolve the
fact that during
>>>>>>>>> working agents stop to work? or help only for cleaning
lock agent when i
>>>>>>>>> restart the process?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> L. Alicata
>>>>>>>>>
>>>>>>>>> 2016-05-06 14:15 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi Luca,
>>>>>>>>>>
>>>>>>>>>> With file-based synchronization, if you kill any
of the processes
>>>>>>>>>> involved, you will need to execute the lock-clean
procedure to make sure
>>>>>>>>>> you have no dangling locks in the file system.
>>>>>>>>>>
>>>>>>>>>> - shut down all MCF processes (except the database)
>>>>>>>>>> - run the lock-clean script
>>>>>>>>>> - start your MCF processes back up
>>>>>>>>>>
>>>>>>>>>> I suspect what you are seeing is related to this.
>>>>>>>>>>
>>>>>>>>>> Also, please consider using Zookeeper instead, since
it is more
>>>>>>>>>> robust about cleaning out dangling locks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 8:06 AM, Luca Alicata <
>>>>>>>>>> alicataluca@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>> thanks for help.
>>>>>>>>>>> In my case i've only one instance of MCF running,
with both type
>>>>>>>>>>> of job (SP and Generic), and so i have only one
properties files (that i
>>>>>>>>>>> have attached).
>>>>>>>>>>> For information i used (multiprocess-file configuration)
with
>>>>>>>>>>> postgres.
>>>>>>>>>>>
>>>>>>>>>>> Do you have other suggestions? do you need more
information,
>>>>>>>>>>> that i can give you?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> L.Alicata
>>>>>>>>>>>
>>>>>>>>>>> 2016-05-06 12:55 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Luca,
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have multiple independent MCF clusters
running at the
>>>>>>>>>>>> same time?  It sounds like you do: you have
SP on one, and Generic on
>>>>>>>>>>>> another.  If so, you will need to be sure
that the synchronization you are
>>>>>>>>>>>> using (either zookeeper or file-based) does
not overlap.  Each cluster
>>>>>>>>>>>> needs its own synchronization.  If there
is overlap, then doing things with
>>>>>>>>>>>> one cluster may cause the other cluster to
hang.  This also means you have
>>>>>>>>>>>> to have different properties files for the
two clusters, of course.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 4:32 AM, Luca Alicata
<
>>>>>>>>>>>> alicataluca@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> i'm using Manifold 2.2 with multi-process
configuration in
>>>>>>>>>>>>> Jboss instance inside a Windows Server
2012 and i've a set of job that work
>>>>>>>>>>>>> with Sharepoint (SP) or Generic Connector
(GC), that get file from a db.
>>>>>>>>>>>>> With SP i've no problem, while with GC
with a lot of document
>>>>>>>>>>>>> (one with 47k and another with 60k),
the Seed taking process, sometimes,
>>>>>>>>>>>>> not finish, because the agents seem to
stop (although java process is still
>>>>>>>>>>>>> alive).
>>>>>>>>>>>>> After this, if i try to start any other
job, that not start,
>>>>>>>>>>>>> like the agents are stopped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other times, this jobs work correctly
and one time together
>>>>>>>>>>>>> work correctly, running in the same moment.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For information:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - On Jboss there are only Manifold
and Generic Repository
>>>>>>>>>>>>>    application.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - On the same Virtual Server, there
is another Jboss
>>>>>>>>>>>>>    istance, with solr istance and a web
application.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - I've check if it was a type of memory
problem, but it's
>>>>>>>>>>>>>    not the case.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - GC with almost 23k seed work always,
at least in test
>>>>>>>>>>>>>    that i've done.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - In local instance of Jboss with
Manifold and Generic
>>>>>>>>>>>>>    Rpository Application, i've not keep
this problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is the only recurrent information
that i've seen on
>>>>>>>>>>>>> manifold.log:
>>>>>>>>>>>>> ---------------
>>>>>>>>>>>>> Connection 0.0.0.0:62755<-><ip-address>:<port>
shut down
>>>>>>>>>>>>> Releasing connection
>>>>>>>>>>>>> org.apache.http.impl.conn.ManagedClientConnectionImpl@6c98c1bd
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> L. Alicata
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message