manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR
Date Mon, 27 Feb 2012 18:22:02 GMT
If you've got a mix of data and only some of it comes through
ManifoldCF, you can still use the ManifoldCF-generated URL for those
that originate with ManifoldCF.  This should even work for documents
from the JCIFS connector - even though the default urls from this
connector are "file:" style, there's a mapping you can set up for
documents from that connector that maps to a URL format of your
choice.  Similarly, most JDBC document urls can readily be constructed
as part of the database queries that you provide for the job.  So it
does not sound like your servlet would have to do anything custom for
any of the data that comes from ManifoldCF at this time, as long as
you define your connections and jobs with some care as to the URLs
they will produce.

Thanks,
Karl


On Mon, Feb 27, 2012 at 11:25 AM, Matthew Parker
<mparker@apogeeintegration.com> wrote:
> Karl,
>
> I'm importing data from a number of sources to include: SharePoint, File
> shares, and an ORACLE database. The files/records are indexed by SOLR.
>
> Right now, some of the import is done through custom SOLR's Data Import
> Handler facilities. I'm hoping to move away from that in the future.
>
> We are also aggregating some of the file share data into custom views on the
> web client. Lots of preprocessing.
>
> All of this is stored in the SOLR index with metadata related as to how to
> display it within our custom web client. If the result is a certain type,
> we have custom templates that are display as a result of that.
>
> Manifold is a good solution for the SharePoint data. We don't really do any
> custom processing on it other than strip HTML from the text.
> It's the database and file share information  that adds some challenges. I'm
> hoping to get SOLR out of the text processing pipeline, and just
> let it index data. We are moving to Pentaho at some point, and we'll
> probably handle most of the custom metadata processing there.
> At some point, we'll possibly integrate Pentaho as an output connection in
> Manifold.
>
> Thanks,
>
> Matt
>
> On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Please see my response interleaved below.
>>
>> On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
>> <mparker@apogeeintegration.com> wrote:
>> > I'm trying to push data into SOLR..
>> >
>> > Is there a way to transform the metadata coming in from different data
>> > sources like SharePoint, and the File Share, prior to posting it into
>> > SOLR?
>> >
>>
>> In general, ManifoldCF does not have data transformation abilities.
>> With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
>> extract content from documents and to perform transformations to
>> document metadata etc.  It is possible that at some point it will be
>> possible to do more transformations in ManifoldCF in order to support
>> search engines that don't have a pipeline, but that is currently not
>> available.
>>
>> > For instance, documents have metadata specifying their file path. I need
>> > to
>> > transform that to a URL I can use within SOLR to retrieve that document
>> > through a servlet that I wrote.
>> >
>>
>> The ManifoldCF model is that a connector creates a URL for each
>> document that it indexes, using whatever makes sense for that
>> particular repository to get you back to the document in question.
>> So, for instance, Documentum documents will use URLs that point at
>> Documentum's Webtop web application.
>>
>> It would be helpful to understand more precisely what you are trying
>> to do.  You could, for instance, modify your servlet to redirect to
>> the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
>> field.
>>
>> > Also, based on specific metadata that I'm seeing in the documents, I
>> > might
>> > want to conditionally add populate other fields in SOLR index.
>> >
>>
>> That sounds like a job for the Tika pipeline to me.
>>
>> Thanks,
>> Karl
>>
>> > ------------------------------
>> > This e-mail and any files transmitted with it may be proprietary.
>> >  Please
>> > note that any views or opinions presented in this e-mail are solely
>> > those of
>> > the author and do not necessarily represent those of Apogee Integration.
>> >
>
>
> ------------------------------
> This e-mail and any files transmitted with it may be proprietary.  Please
> note that any views or opinions presented in this e-mail are solely those of
> the author and do not necessarily represent those of Apogee Integration.
>

Mime
View raw message