manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anupam Bhattacharya <anupam...@gmail.com>
Subject Re: Need Help on setting up ManifoldCF
Date Thu, 23 Feb 2012 19:41:47 GMT
Thanks Karl,

I was just curious.. can the Documentum Connector present in ManifoldCF
index binary documents also in addition to the content model defined
document types & its metadata ?

Since configuring documentum repository connection in ManifoldCF for
crawler and then again in SOLR to fetch the actual document will be repeat
work to fetch metadata of one document.

Regards
Anupam

On Fri, Feb 24, 2012 at 12:44 AM, Karl Wright <daddywri@gmail.com> wrote:

> Glad it is working for you!
>
> Solr is almost infinitely flexible, so you have many options.
>
> In my opinion the best way you convert binary documents to indexable
> text is indeed to use Solr Cell.  Solr Cell is constructed on Tika, so
> you won't need to bring in Tika for this because it should already be
> there. Tika has a pipeline architecture which should suit your use
> case well.   It should thus be possible to configure the existing
> update handler to use Solr Cell, and configure Solr Cell's Tika
> instance to perform whatever transformations you need.
>
> Hope this helps.  For further Solr questions, you can always ask on
> the Solr user list.  A Tika user list is also available.
>
> Thanks,
> Karl
>
> On Thu, Feb 23, 2012 at 2:04 PM, Anupam Bhattacharya
> <anupamb82@gmail.com> wrote:
> > Hello Karl,
> >
> > Finally, I was able to index all the metadata for the defined document
> types
> > with different content types. Everything went well.
> > Although I was not able to index the file full text content. (like PDF,
> > XML). I read about SOLR Cell where using CURL we can upload documents but
> > unfortunately our XML files structure contains Tag & values which also
> needs
> > to be indexed.
> > e.g, some XML structure..
> >
> > <doc>
> > <object_id>111</object_id>
> > <abstract>Abstract Text</abstract>
> > <citation>Citation Text</citation>
> > <publication>News Source</publication>
> > </doc>
> >
> > I found that in SOLR if we add a new RequestHandler Code extending the
> > ExtractingRequestHandler we can parse the documents fetch information and
> > add it as index field in the SOLR index.
> >
> > What is the ideal approach for indexing tag values from XML in lucene
> from
> > ManifoldCF to SOLR ? Is it necessary to integrate TIKA for this ?
> > I found a good post over here.. https://community.emc.com/docs/DOC-6520
> >
> > Appreciate your advice on this.
> >
> > Regards
> > Anupam
> >
> >
> >
> >
> > On Thu, Feb 16, 2012 at 12:17 AM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>
> >> On Wed, Feb 15, 2012 at 1:13 PM, Anupam Bhattacharya
> >> <anupamb82@gmail.com> wrote:
> >> > Hello Karl,
> >> >
> >> > Thanks for adding this to the JIRA system.
> >> >
> >> > The dfc.properties was introduced from Documentum 6.0 version onwards
> &
> >> > as
> >> > per manifoldcf connector documentation
> >> > (
> http://incubator.apache.org/connectors/en_US/included-connectors.html)
> >> > the
> >> > out-of the box connector classes were tested against DFC 5.3 SP5 which
> >> > needed the dmcl.ini file. Thus run.bat must have been configured
> >> > properly
> >> > for that dmcl.ini.
> >>
> >> Right - so does DFC 6.0 on Windows require the DOCUMENTUM environment
> >> variable to be set to point at the directory where dfc.properties is
> >> found?  Or perhaps it doesn't require the DOCUMENTUM environment
> >> variable at all anymore?
> >>
> >> >
> >> > As I am trying to connect to DFC 6.5 SP3 version i need to look for
> >> > dfc.properties file. I hope the out-of the box documentum connector
> will
> >> > work with 6.5 version.
> >>
> >> It was tried and worked.  The script was developed later with only the
> >> 5.3 version available.
> >>
> >> >
> >> > I am confused, why for all connector we have Client & Server version
?
> >> > Can
> >> > you please explain.
> >> >
> >>
> >> Do you mean "why is there a documentum-connector-server" process?  If
> >> that's the question, it was created for two reasons:
> >> (1) We had problems with stability of DFC.  It segfaults occasionally,
> >> somewhere in its native code.  We did not want that to bring down
> >> ManifoldCF, and we wanted to be able to restart the part of the
> >> connector that depended on DFC transparently when it crashed.
> >> (2) DFC has dependencies on many older open-source jars that conflict
> >> with the rest of ManifoldCF.  If (1) was not a problem we might have
> >> used a classloader to fix this, but since we had to fix both we
> >> created a separate process.
> >>
> >> FWIW, we do the same thing for FileNet because of its dependency on
> Wasp.
> >>
> >> Karl
> >>
> >> > Again, Thanks for all the help.
> >> >
> >> > Regards
> >> > Anupam
> >> >
> >> >
> >> > On Wed, Feb 15, 2012 at 8:42 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >> >>
> >> >> Hi Anupam,
> >> >>
> >> >> I did not see a ticket from you about the DOCUMENTUM environment
> >> >> variable and the dmcl.ini vs. dfc.properties file.  I've created an
> >> >> issue at https://issues.apache.org/jira/browse/CONNECTORS-410 to
> track
> >> >> this problem.  It would be great if you could confirm that: (a) the
> >> >> DOCUMENTUM environment variable is still needed at all by DFC, and
> (b)
> >> >> that when it is set properly, the file dfc.properties can be found
at
> >> >> $DOCUMENTUM\dfc.properties (on Windows, at least).
> >> >>
> >> >> Thanks,
> >> >> Karl
> >> >>
> >> >> On Tue, Feb 14, 2012 at 3:23 PM, Karl Wright <daddywri@gmail.com>
> >> >> wrote:
> >> >> > Hi Anupam,
> >> >> >
> >> >> > Please post emails like this directly to
> >> >> > connectors-user@incubator.apache.org.  See below for responses.
> >> >> >
> >> >> > On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya
> >> >> > <anupamb82@gmail.com> wrote:
> >> >> >>
> >> >> >> Hello Karl,
> >> >> >>
> >> >> >> I am a software programmer in DuPont, Gurgaon, India. Recently,
> due
> >> >> >> to
> >> >> >> the
> >> >> >> economic instability all over the world the company has decided
to
> >> >> >> go
> >> >> >> for
> >> >> >> cheaper Search Engine Applications. Thus we are getting rid
of
> many
> >> >> >> costly
> >> >> >> proprietary Search Applications and will be replacing with
FAST.
> >> >> >>
> >> >> >> Although, I recently came across SOLR search engine &
ManiFoldCF
> >> >> >> Connector
> >> >> >> framework. Thus, I am currently driving this effort within
my
> >> >> >> company
> >> >> >> as i
> >> >> >> am a big supporter of open source technologies. I started
my
> career
> >> >> >> in
> >> >> >> Alfresco CMS and now working on Search Technologies.
> >> >> >>
> >> >> >> Currently I am facing lots of initial
> building/deploying/installing
> >> >> >> issues.
> >> >> >> I have already referred the url
> >> >> >>
> >> >> >>
> >> >> >>
> http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html
> >> >> >> Read it multiple times but still face many issues. I downloaded
> the
> >> >> >> latest
> >> >> >> 0.4 version and it seems the documentation is not up to date
on
> the
> >> >> >> above
> >> >> >> link.
> >> >> >>
> >> >> >
> >> >> > The online documentation is pertinent to trunk.  The documentation
> >> >> > you
> >> >> > want to use is contained within the 0.4-incubating release.  Go
to
> >> >> > dist/doc and you will see it there.
> >> >> >
> >> >> >> Few issues which took me a long time to resolve which can
be added
> >> >> >> in
> >> >> >> ManifoldCF wiki as learnings for others are listed below:
> >> >> >> a. No single example is given for running the executecommand.bat
> >> >> >> with
> >> >> >> proper
> >> >> >> arguments. Only list of commands given with parameter defined.
> >> >> >
> >> >> > I'm not entirely sure I get this.  Do you just want an example
in
> the
> >> >> > documentation?
> >> >> >
> >> >> >> b. Setting where and which file for the property
> >> >> >> manifoldcf.configfile
> >> >> >> for deploying the war on tomcat with Postgresql database.
> >> >> >
> >> >> > The documentation already tells you that you need to add an
> >> >> > appropriate -D to your tomcat invocation to point to your
> >> >> > properties.xml file.  Tomcat documentation differs from version
to
> >> >> > version and platform to platform on how best to do that, and if
you
> >> >> > run under Windows there's even a service wrapper with a
> configuration
> >> >> > UI that allows you to set these parameters.  So it's way beyond
> >> >> > ManifoldCF's mission to describe all that, I think.
> >> >> >
> >> >> >> c. I am trying to build the Documentum Connector but came
to know
> >> >> >> that
> >> >> >> some
> >> >> >> additional environment variables needs to be added for
> "DOCUMENTUM".
> >> >> >> Additionally the latest version of documentum uses dfc.properties
> >> >> >> file
> >> >> >> while
> >> >> >> run.bat look for dctl.ini file.
> >> >> >
> >> >> > Could you open a ticket in Jira for this issue?
> >> >> > https://issues.apache.org/jira. It should not be a problem if
you
> >> >> > modify the script temporarily, but we can readily make the script
> >> >> > look
> >> >> > for either of these.
> >> >> >
> >> >> >> d. postgresql driver is jdbc3 thus it creates problem with
JVM6 or
> >> >> >> above.
> >> >> >
> >> >> > We use JDK 6 all the time without problems, so I don't know what
> you
> >> >> > are talking about here.
> >> >> >
> >> >> >> e. I was getting errors during  the ant build which tries
to
> delete
> >> >> >> jar
> >> >> >> files from lib directory. Don't have the source code right
now
> with
> >> >> >> me
> >> >> >> thus
> >> >> >> cant provide the full path.
> >> >> >
> >> >> > It sounds like you were trying to run ant while you still had
> >> >> > ManifoldCF processes running from the same tree.
> >> >> >
> >> >> >> f. It was advised in the documentation to set MCF_Home for
> >> >> >> example_multiprocess project but it seems the build of documentum
> >> >> >> connector
> >> >> >> refers to this property differently from run.bat.
> >> >> >
> >> >> > Yes, this was noticed and fixed on trunk recently.
> >> >> >
> >> >> >>
> >> >> >> Can you please update the Apache ManifoldCF website with the
> latest
> >> >> >> installation procedures. Also, It will be very kind of you
in the
> >> >> >> meanwhile
> >> >> >> if you can send few notes for me to head start the configuration
> of
> >> >> >> ManifoldCF, with SOLR & Documentum connector.
> >> >> >>
> >> >> >
> >> >> > The documentation online has been updated to be consistent with
> >> >> > trunk,
> >> >> > so if you want to use the trunk version this might be a good
> >> >> > opportunity to help clarify the documentation.  Either that or
you
> >> >> > will need to stick with the 0.4-incubating release and the
> >> >> > 0.4-incubating documentation that is part of it; we cannot at
this
> >> >> > time update documentation that has already been released.
> >> >> >
> >> >> > Thanks,
> >> >> > Karl
> >> >> >
> >> >> >> Looking forward for your help.
> >> >> >>
> >> >> >> Thanks & Regards
> >> >> >> Anupam Bhattacharya
> >> >> >>
> >> >> >>
> >> >> >>
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Mime
View raw message