manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Does manifoldcf supports sub entities/queries ?
Date Thu, 24 Mar 2016 12:43:54 GMT
Hi Victor,

It looks like you are setting up your Solr DIH to define an individual
document as being defined by a unique (email, destination) tuple?

If this is correct, some comments.

First, ManifoldCF only ever indexes a single document at a time.  It never
indexes pieces of documents.  I don't believe that Solr has that capacity
either.  So everything depends on your definition of a document is.

Second, you would want to use the MCF JDBC connector.  The JDBC connector
is relatively primitive and uses a flat model.  This requires you to
construct queries so that you define a document "ID", which in this case
would consist of the email ID PLUS the destination, plus whatever data is
required by the query.  Obviously you'd need a join in your seeding query
to produce the proper set of IDs in question.  In fact, ALL of your queries
will have to have a join in them in order to produce the information for
the email + destination ID.  Because the JDBC connector is this simple, you
may find that your underlying schema is a poor match for the queries that
would be necessary to plug into the JDBC connector and cannot support the
needed indexes.  In this case you can either modify your schema, use a
view, or write a custom connector that does more precisely what you need.

Hope this helps,
Karl


2016-03-24 7:56 GMT-04:00 Victor D'agostino <victor.d.agostino@fiducial.net>
:

> Hi guys
>
> I'm testing manifoldcf 1.10 to crawl data from a postgresql database to a
> solr cloud ensemble.
> My database is used to store emails. For each email there is a details
> table entry and one or several recipients entry in an other table.
>
> I need help setting the data query in my crawling job.
>
> How can i avoid crawling the details each time there is a recipient ?
> In Solr DIH it's called a subentities :
>
> <dataSource type="JdbcDataSource" driver="org.postgresql.Driver" [...]"/>
>     <document>
>         <entity name="mail"
>         query="SELECT email_id, [...] emetteur_budget FROM email_details
> WHERE [ ...]
>           <field column="email_id" name="mail_id" />
>             [...]
>             <field column="emetteur_budget" name="emetteur_budget" />
>
>             <entity name="destinataires"
>             query="select utilisateur_id, adresse_email, [...] where
> email_id='${mail.email_id}' and date='${mail.date}'">
>                 <field column="utilisateur_id" name="destinataire_ids" />
>                 <field column="adresse_email" name="destinataire_mails" />
>                 [...]
>             </entity>
>
>         </entity>
>     </document>
>
>
> Does manifoldcf supports subentities ?
>
> Regards
> Victor
>
>
> 
> ________________
> Ce message et les éventuels documents joints peuvent contenir des
> informations confidentielles. Au cas où il ne vous serait pas destiné, nous
> vous remercions de bien vouloir le supprimer et en aviser immédiatement
> l'expéditeur. Toute utilisation de ce message non conforme à sa
> destination, toute diffusion ou publication, totale ou partielle et quel
> qu'en soit le moyen est formellement interdite. Les communications sur
> internet n'étant pas sécurisées, l'intégrité de ce message n'est pas
> assurée et la société émettrice ne peut être tenue pour responsable de son
> contenu.

Mime
View raw message