manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: [jira] [Commented] (CONNECTORS-840) Job - Solr Mapping Improvement
Date Fri, 24 Jul 2015 13:52:21 GMT
Hi Ramanan,
Mcf handles documents in a fully atomic manner. You cannot index or
track partial documents. If you try to have more than one document with
the same Id handled by the same repository connection, only one of them
will be indexed.



Sent from my Windows Phone
From: Ramanan Sathiyanarayanan (JIRA)
Sent: 7/24/2015 8:19 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Commented] (CONNECTORS-840) Job - Solr Mapping
Improvement

    [ https://issues.apache.org/jira/browse/CONNECTORS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640382#comment-14640382
]

Ramanan Sathiyanarayanan commented on CONNECTORS-840:
-----------------------------------------------------

Hi - We don't have metadata and content in one place. So, we have to
write our own connector to consolidate the data from two different
sources. This works fine so far. But, we need some more data from
third source (eg. usage data) and we want to use this new data-point
for our scoring logic in Solr. This data is generated daily in a
database and we need to use JDBCConnector to update few fields in
Solr. Since we need to update only once a day, we don't want to make
it look like a RepositoryDocument changed and create un-necessary load
for our original connector and its backends.

1. For both these jobs, the ID of the document will be same.
2. Can MCF support two different jobs that will be having same ID.
3. Since the ID will be same, can I make the Solr output-connector for
partial update. We may avoid tika end point, since we are updating few
fields directly.

> Job - Solr Mapping Improvement
> ------------------------------
>
>                 Key: CONNECTORS-840
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-840
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.4.1
>            Reporter: Alessandro Benedetti
>            Assignee: Karl Wright
>            Priority: Minor
>              Labels: field, mapping, request, solr, update
>             Fix For: ManifoldCF 1.5
>
>         Attachments: CONNECTORS-840.patch
>
>
> "When you configure a job to use a Solr-type output connection, the Solr connection type
provides a tab called "Field Mapping". The purpose of this tab is to allow you to map metadata
fields as fetched by the job's connection type to fields that Solr is set up to receive. This
is necessary because the names of the metadata items are often determined by the repository,
with no alignment to fields defined in the Solr schema. You may also suppress specific metadata
items from being sent to the index using this tab.
> Add a new mapping by filling in the "source" with the name of the metadata item from
the repository, and "target" as the name of the output field in Solr, and click the "Add"
button. Leaving the "target" field blank will result in all metadata items of that name not
being sent to Solr."
> In my opinion we should change the way a metadata field is suppressed.
> The most natural way is that we express only the mappings of the metadata fields we want
to keep.
> All the missing params will not be sent to Solr.
> The improvement will be :
> - same interface with a boolean flag in addition, this flag will specify if the missing
metadata fields not expressed should be sent to Solr with the original names or not sent at
all.
> In this way if we want to keep 3/100 metadata fields, we don't have to write 100 mapping
entries , 97 empty but simply 3 entries and activate the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message