manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dileepa Jayakody <djayak...@zaizi.com>
Subject Indexing Solr documents with atomic updates using manifoldcf solr connector
Date Mon, 10 Aug 2015 13:05:20 GMT
Hi All,

We have a requirement to extract some meta-data from content documents and
index those meta-data as separate documents into a Solr index.
I'm writing a transformation connector where I construct a new repository
document adding the meta-data extracted by the connector and hand it over
to mcf-solr-connector to index in Solr.
Currently I face some difficulties with indexing these new documents in
Solr properly using solr-connector.

The new solr document should contain some atomic updates for certain
fields. So in my connector I create a JSON to represent the Solr atomic
update request and set is as the binaryStream of the repository
document.The json string for the new solr document is as below;

String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
","label":"Africa","documents":{"add":"sample2.txt"}}]";


Then, I add an id and set above jsonString as the binary input stream of
the repo-document as follows;

repoDoc.addField( "id", idString );
InputStream inputStream = IOUtils.toInputStream( jsonString );
repoDoc.setBinary(inputStream, jsonString.getBytes().length);

The expected behavior is Solr connector sending the SolrInputDocument
constructed from the inputStream I added to the repo-document from my
connector. But instead it adds the JSON  string to the  'content' field of
the solr-document and sends to Solr.

When I monitored the HTTP request from manifold to Solr I see below;

POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
<add>
   <doc boost="1.0">
      <field name="id">http://dbpedia.org/resource/Africa</field>
      <field name="_root_">[{"id":"http://dbpedia.org/resource/Africa
","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
      <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
</field>
   </doc></add>0

Please note that the 'content' field configured in manifoldcf is *_root_*.

But the expected Solr update request from solr-connector should be as below;
<add>
   <doc boost="1.0">
    <field name="id">http://dbpedia.org/resource/Africa</field>
     <field name="label">Africa</field>
      <field name="documents" update="add">sample2.txt</field>
     <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
</field>
   </doc></add>0


Can someone please give some advice on how to use solr atomic updates with
manifoldcf solr-connector? Have I missed some configurations/arguments?

Thanks,
Dileepa

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message