manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Alicata <alicatal...@gmail.com>
Subject Re: Job with Generic Connector stop to work
Date Mon, 09 May 2016 16:31:02 GMT
Hi Karl,
unfortunately i'm busy in this day, but i try to test and let you known.

Thanks,
L. Alciata

2016-05-09 18:04 GMT+02:00 Karl Wright <daddywri@gmail.com>:

> Hi Luca,
>
> I've put together code that should allow multivalued attributes to be
> crawled.  In order to try it, you will need to check out the
> CONNECTORS-1313 branch:
>
> svn checkout
> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1313
>
> Then, build:
>
> ant make-core-deps
> ant build
>
> Please give this a try and see if it works for you.
>
> Thanks,
> Karl
>
>
> On Fri, May 6, 2016 at 10:15 AM, Luca Alicata <alicataluca@gmail.com>
> wrote:
>
>> Hi Karl,
>> I can confirm that it is a little expensive, but at that time, i haven't
>> much time, and i stop to work after found the solution.
>> Thanks for the creation of the ticket, for the moment, i try to use
>> generic connector.
>>
>> An other question, there is another connector that can use an application
>> to receive data? Like GenericConnector?
>>
>> Thanks,
>> L. Alicata
>>
>> 2016-05-06 16:02 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>
>>> Hi Luca,
>>>
>>> This approach causes each document's binary data to be read more than
>>> once.  I think that is expensive, especially if there are a lot of values.
>>> for a row.
>>>
>>> Instead I think something more like ACLs will be needed -- that is, a
>>> separate query for each multi-valued field.  This is more work but it would
>>> work much better.
>>>
>>> I will create a ticket to add this to the JDBC connector, but it won't
>>> happen for a while.
>>>
>>> Karl
>>>
>>>
>>> On Fri, May 6, 2016 at 9:40 AM, Luca Alicata <alicataluca@gmail.com>
>>> wrote:
>>>
>>>> I've decompile java connector and modified the code in this way:
>>>>
>>>> in process document, i see that just currently arrive all row of query
>>>> result (also multi values row), but in the cycle that parse document, after
>>>> first document with an ID, all the other with the same are skipped.
>>>> So i removed the control that not permits to check other document with
>>>> the same ID and i modified the method that store metadata, to permit to
>>>> store multi value data as array in metadata mapping.
>>>>
>>>> I attached the code in this e-mail. You can find a comment that start
>>>> with "---", that i insert know for you.
>>>>
>>>> Thanks,
>>>> L. Alicata
>>>>
>>>> 2016-05-06 15:25 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>
>>>>> Ok, it's now clear what you are looking for, but it is still not clear
>>>>> how we'd integrate that in the JDBC connector.  How did you do this when
>>>>> you modified the connector for 1.8?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, May 6, 2016 at 9:21 AM, Luca Alicata <alicataluca@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>> sorry for my english :).
>>>>>> I mean the fact that i've to extract value from query with a join
>>>>>> between two table with a relationship of one-to-many, the dataset
returned
>>>>>> from Connector is only one pair from the two table.
>>>>>>
>>>>>> For example:
>>>>>> Table A with persons
>>>>>> Table B with eyes
>>>>>>
>>>>>> As result of join, i aspect have two row like:
>>>>>> person 1, eye left
>>>>>> person 1, eye right
>>>>>>
>>>>>> but the connector returns only one row:
>>>>>> person 1, eye left
>>>>>>
>>>>>> I hope now it's more clear.
>>>>>>
>>>>>> Ps. i report the phrase on Manifold documentation that explain that
(
>>>>>> https://manifoldcf.apache.org/release/release-2.3/en_US/end-user-documentation.html#jdbcrepository
>>>>>> ):
>>>>>> ------
>>>>>> There is currently no support in the JDBC connection type for
>>>>>> natively handling multi-valued metadata.
>>>>>> ------
>>>>>>
>>>>>> Thanks,
>>>>>> L. Alicata
>>>>>>
>>>>>>
>>>>>> 2016-05-06 15:10 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>
>>>>>>> Hi Luca,
>>>>>>>
>>>>>>> It is not clear what you mean by "multi value extraction" using
the
>>>>>>> JDBC connector.  The JDBC connector allows collection of primary
binary
>>>>>>> content as well as metadata from a database row.  So maybe if
you can
>>>>>>> explain what you need beyond that it would help.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 9:04 AM, Luca Alicata <alicataluca@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>> thanks for information, fortunately in other jboss instance
i have
>>>>>>>> a old Manifold configuration with single process, that i've
dismissed. But
>>>>>>>> in this moment, i start to test this jobs with that and if
it work fine, i
>>>>>>>> can use it only for this job and use it also in production.
Maybe after, if
>>>>>>>> i can, i try to check the possible problem that stop the
agent.
>>>>>>>>
>>>>>>>> I Take advantage of this discussion to ask you, if multi-value
>>>>>>>> extraction from db is consider as possible future work or
no. Because i've
>>>>>>>> used this generi connector to resolve this lack of JDBC Connector.
In fact
>>>>>>>> with Manifold 1.8 i've modified the connector to support
this behavior (in
>>>>>>>> addiction to parse blob file), but upgrade Manifold Version,
to not rewrite
>>>>>>>> the new connector i decide to use Generic Connector with
application that
>>>>>>>> do the work of extraction data from DB.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> L. Alicata
>>>>>>>>
>>>>>>>> 2016-05-06 14:42 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi Luca,
>>>>>>>>>
>>>>>>>>> If you do a lock clean and the process still stops, then
the locks
>>>>>>>>> are not the problem.
>>>>>>>>>
>>>>>>>>> One way we can drill down into the problem is to get
a thread dump
>>>>>>>>> of the agents process after it stops.  The thread dump
must be of the
>>>>>>>>> agents process, not any of the others.
>>>>>>>>>
>>>>>>>>> FWIW, the generic connector is not well supported; the
person who
>>>>>>>>> wrote it is still a committer but is not actively involved
in MCF
>>>>>>>>> development at this time.  I suspect that the problem
may have to do with
>>>>>>>>> how that connector deals with exceptions or errors, but
I am not sure.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 8:38 AM, Luca Alicata <
>>>>>>>>> alicataluca@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>> I've just tried with lock-clean after agents stop
to work,
>>>>>>>>>> obviously after stopping process. After this, job
start correctly, but just
>>>>>>>>>> second time that i start a job with a lot of data
(or sometimes the third
>>>>>>>>>> time), agent stop again.
>>>>>>>>>>
>>>>>>>>>> Unfortunately, it's difficult start, for the moment,
to using
>>>>>>>>>> Zookeeper in this environment, but this can resolve
the fact that during
>>>>>>>>>> working agents stop to work? or help only for cleaning
lock agent when i
>>>>>>>>>> restart the process?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> L. Alicata
>>>>>>>>>>
>>>>>>>>>> 2016-05-06 14:15 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> Hi Luca,
>>>>>>>>>>>
>>>>>>>>>>> With file-based synchronization, if you kill
any of the
>>>>>>>>>>> processes involved, you will need to execute
the lock-clean procedure to
>>>>>>>>>>> make sure you have no dangling locks in the file
system.
>>>>>>>>>>>
>>>>>>>>>>> - shut down all MCF processes (except the database)
>>>>>>>>>>> - run the lock-clean script
>>>>>>>>>>> - start your MCF processes back up
>>>>>>>>>>>
>>>>>>>>>>> I suspect what you are seeing is related to this.
>>>>>>>>>>>
>>>>>>>>>>> Also, please consider using Zookeeper instead,
since it is more
>>>>>>>>>>> robust about cleaning out dangling locks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 8:06 AM, Luca Alicata
<
>>>>>>>>>>> alicataluca@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>> thanks for help.
>>>>>>>>>>>> In my case i've only one instance of MCF
running, with both
>>>>>>>>>>>> type of job (SP and Generic), and so i have
only one properties files (that
>>>>>>>>>>>> i have attached).
>>>>>>>>>>>> For information i used (multiprocess-file
configuration) with
>>>>>>>>>>>> postgres.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have other suggestions? do you need
more information,
>>>>>>>>>>>> that i can give you?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> L.Alicata
>>>>>>>>>>>>
>>>>>>>>>>>> 2016-05-06 12:55 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Luca,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have multiple independent MCF
clusters running at the
>>>>>>>>>>>>> same time?  It sounds like you do: you
have SP on one, and Generic on
>>>>>>>>>>>>> another.  If so, you will need to be
sure that the synchronization you are
>>>>>>>>>>>>> using (either zookeeper or file-based)
does not overlap.  Each cluster
>>>>>>>>>>>>> needs its own synchronization.  If there
is overlap, then doing things with
>>>>>>>>>>>>> one cluster may cause the other cluster
to hang.  This also means you have
>>>>>>>>>>>>> to have different properties files for
the two clusters, of course.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 4:32 AM, Luca
Alicata <
>>>>>>>>>>>>> alicataluca@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> i'm using Manifold 2.2 with multi-process
configuration in
>>>>>>>>>>>>>> Jboss instance inside a Windows Server
2012 and i've a set of job that work
>>>>>>>>>>>>>> with Sharepoint (SP) or Generic Connector
(GC), that get file from a db.
>>>>>>>>>>>>>> With SP i've no problem, while with
GC with a lot of document
>>>>>>>>>>>>>> (one with 47k and another with 60k),
the Seed taking process, sometimes,
>>>>>>>>>>>>>> not finish, because the agents seem
to stop (although java process is still
>>>>>>>>>>>>>> alive).
>>>>>>>>>>>>>> After this, if i try to start any
other job, that not start,
>>>>>>>>>>>>>> like the agents are stopped.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Other times, this jobs work correctly
and one time together
>>>>>>>>>>>>>> work correctly, running in the same
moment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For information:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - On Jboss there are only Manifold
and Generic Repository
>>>>>>>>>>>>>>    application.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - On the same Virtual Server,
there is another Jboss
>>>>>>>>>>>>>>    istance, with solr istance and
a web application.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - I've check if it was a type
of memory problem, but it's
>>>>>>>>>>>>>>    not the case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - GC with almost 23k seed work
always, at least in test
>>>>>>>>>>>>>>    that i've done.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - In local instance of Jboss with
Manifold and Generic
>>>>>>>>>>>>>>    Rpository Application, i've not
keep this problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is the only recurrent information
that i've seen on
>>>>>>>>>>>>>> manifold.log:
>>>>>>>>>>>>>> ---------------
>>>>>>>>>>>>>> Connection 0.0.0.0:62755<-><ip-address>:<port>
shut down
>>>>>>>>>>>>>> Releasing connection
>>>>>>>>>>>>>> org.apache.http.impl.conn.ManagedClientConnectionImpl@6c98c1bd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> L. Alicata
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message