Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88CA318030 for ; Tue, 11 Aug 2015 04:44:42 +0000 (UTC) Received: (qmail 82205 invoked by uid 500); 11 Aug 2015 04:44:42 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 82158 invoked by uid 500); 11 Aug 2015 04:44:42 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 82145 invoked by uid 99); 11 Aug 2015 04:44:41 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2015 04:44:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C582EDC213 for ; Tue, 11 Aug 2015 04:44:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=zaizi.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id rphZ1lJpFP7J for ; Tue, 11 Aug 2015 04:44:32 +0000 (UTC) Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 3BEC242AFB for ; Tue, 11 Aug 2015 04:44:32 +0000 (UTC) Received: by wibhh20 with SMTP id hh20so177598577wib.0 for ; Mon, 10 Aug 2015 21:44:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zaizi.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cMS7DdEQ5GWKx70iaUaxnasoFl8suRT4XAN4IKj1kls=; b=CDjYqCv0+VPS6qCkJGz3BkDmMVlywxMgt337J5tqqeYMsfJfd8/OBduA0BNfqI4FOs vcfaSCzxg4kcIGQK3xqSfVVkGvSA92ows3+kw1PM9rM/OqG1DG8h7Ri9a7ccgLH2XUmg ztmT5rD9QHo5x9KNDzK+dMOVk0gPlC/hvH1K4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=cMS7DdEQ5GWKx70iaUaxnasoFl8suRT4XAN4IKj1kls=; b=SKP0RQsEtRX52RFWJDwxCdadBleJjL6qO+eVO1hBeCVmE3ppnytqnEUh4AxLOFNMeF bagQsXeQ97crmfUbbi3xdYhPqPWBYfO5sz2Pzmrixo7SidEkVNai5Bt5HJbZj4b+WztR Wy874hnMUbjXP6bVx8erXyCmYStzqIiXeQyVMVhOveNt06qomLcgq42+qq5I6RLTHULt 829motzLwzERsNsUAvuXjlJG7KJieeEg6BXy5BDV2EgPhkB1ysnfZSYyfwUJGT8OtISO /WV5aRVs2+WJ+izMlNs3FvJWD3weC/ALCyuCLhdEbsagPaEr9eo3h91E/e2qdOsrXLko h8QA== X-Gm-Message-State: ALoCoQlO6rGuY4t7trIDIvQ9GUYSWYt49BilmvR9beJ/K9u19qb+fAHzAS66dMWzq9rWqnt0qTPGHq3tbFKqZXo7N2WGdgxAWz3w6KwEreY2BxNk4qUzcMGNnkPKW58mLoNOk8LIwwwy MIME-Version: 1.0 X-Received: by 10.194.47.209 with SMTP id f17mr51391949wjn.39.1439268271201; Mon, 10 Aug 2015 21:44:31 -0700 (PDT) Received: by 10.28.227.7 with HTTP; Mon, 10 Aug 2015 21:44:31 -0700 (PDT) In-Reply-To: References: Date: Tue, 11 Aug 2015 10:14:31 +0530 Message-ID: Subject: Re: Indexing Solr documents with atomic updates using manifoldcf solr connector From: Dileepa Jayakody To: dev@manifoldcf.apache.org Content-Type: multipart/alternative; boundary=047d7b86df460abbd9051d01c2db --047d7b86df460abbd9051d01c2db Content-Type: text/plain; charset=UTF-8 Hi Karl, Thanks for your response. My requirement is indexing child documents constructed from the content repo.document as separate Solr documents. So adding meta-data fields to the original repository document wouldn't help my scenario AFAIU. My transformation connector is somewhat similar to the Stanbol transformation connector proposed in manifoldcf jira [1]. What I referred as meta-data are the Named Entity Recognition data (NER) extracted from the repository document. So each content repository document will have multiple NER child documents. These NERs are expected to be indexed as separate Solr documents having a mapping to the parent content repository document which the NERs were extracted from. So apart from indexing the content repository document in Solr, I need to index all NER child documents with their attributes as separate documents in Solr. Above example is how I create a child repo document for NER. I set the entire NER document as the binary stream of the child repository document which is then sent to mcf-solr connector. In the mcf-solr connector (In HttpPoster class) when building the solr document from the repository document's input stream, it adds the inputStream String as a field to the content field of the Solr document configured by solr-connector as below; buildSorDocument(long length, InputStream is){ if (contentAttributeName != null) { Reader r = new InputStreamReader(is, Consts.UTF_8); StringBuilder sb = new StringBuilder((int)length); char[] buffer = new char[65536]; while (true) { int amt = r.read(buffer,0,buffer.length); if (amt == -1) break; sb.append(buffer,0,amt); } outputDoc.addField( contentAttributeName, sb.toString() ); } .... } Therefore the solr-connector sends the JSON update request I constructed in my connector as a field value of the Solr document, not as the whole Solr document. Can you please give me some advice on how to index nested child documents in Solr using Manifold? Thanks, Dileepa [1] https://issues.apache.org/jira/browse/CONNECTORS-1181 On Mon, Aug 10, 2015 at 6:47 PM, Karl Wright wrote: > Hi Dileepa, > > In order for ManifoldCF to index metadata, you need to set metadata field > values in the RepositoryDocument object, not send Solr JSON as the > document's content. In fact from your example it looks like you want zero > content. > > Please read the RepositoryDocument java doc to see how you set metadata. > > Karl > > > On Mon, Aug 10, 2015 at 9:05 AM, Dileepa Jayakody > wrote: > > > Hi All, > > > > We have a requirement to extract some meta-data from content documents > and > > index those meta-data as separate documents into a Solr index. > > I'm writing a transformation connector where I construct a new repository > > document adding the meta-data extracted by the connector and hand it over > > to mcf-solr-connector to index in Solr. > > Currently I face some difficulties with indexing these new documents in > > Solr properly using solr-connector. > > > > The new solr document should contain some atomic updates for certain > > fields. So in my connector I create a JSON to represent the Solr atomic > > update request and set is as the binaryStream of the repository > > document.The json string for the new solr document is as below; > > > > String jsonString = "[{"id":"http://dbpedia.org/resource/Africa > > ","label":"Africa","documents":{"add":"sample2.txt"}}]"; > > > > > > Then, I add an id and set above jsonString as the binary input stream of > > the repo-document as follows; > > > > repoDoc.addField( "id", idString ); > > InputStream inputStream = IOUtils.toInputStream( jsonString ); > > repoDoc.setBinary(inputStream, jsonString.getBytes().length); > > > > The expected behavior is Solr connector sending the SolrInputDocument > > constructed from the inputStream I added to the repo-document from my > > connector. But instead it adds the JSON string to the 'content' field > of > > the solr-document and sends to Solr. > > > > When I monitored the HTTP request from manifold to Solr I see below; > > > > POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1 > > > > > > http://dbpedia.org/resource/Africa > > [{"id":"http://dbpedia.org/resource/Africa > > ","label":"Africa","documents":{"add":"sample2.txt"}}] > > http://dbpedia.org/resource/Africa > > > > 0 > > > > Please note that the 'content' field configured in manifoldcf is > *_root_*. > > > > But the expected Solr update request from solr-connector should be as > > below; > > > > > > http://dbpedia.org/resource/Africa > > Africa > > sample2.txt > > http://dbpedia.org/resource/Africa > > > > 0 > > > > > > Can someone please give some advice on how to use solr atomic updates > with > > manifoldcf solr-connector? Have I missed some configurations/arguments? > > > > Thanks, > > Dileepa > > > > -- > > > > ------------------------------ > > This message should be regarded as confidential. If you have received > this > > email in error please notify the sender and destroy it immediately. > > Statements of intent shall only become binding when confirmed in hard > copy > > by an authorised signatory. > > > > Zaizi Ltd is registered in England and Wales with the registration number > > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, > > London W6 7AN. > > > -- ------------------------------ This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. Zaizi Ltd is registered in England and Wales with the registration number 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, London W6 7AN. --047d7b86df460abbd9051d01c2db--