Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F5F91080F for ; Tue, 23 Jul 2013 22:21:07 +0000 (UTC) Received: (qmail 47087 invoked by uid 500); 23 Jul 2013 22:21:07 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 47033 invoked by uid 500); 23 Jul 2013 22:21:07 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 47025 invoked by uid 99); 23 Jul 2013 22:21:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 22:21:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,T_FRT_COCK X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.220.173] (HELO mail-vc0-f173.google.com) (209.85.220.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 22:21:02 +0000 Received: by mail-vc0-f173.google.com with SMTP id hz11so98938vcb.32 for ; Tue, 23 Jul 2013 15:20:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:from:date :message-id:subject:to:cc:content-type:x-gm-message-state; bh=115J7OH5S1aInkfw25z5C1t8VhhVvQiuC/Xn7SbNTJ4=; b=XwAv3wRzCNB5EMYhO+KW9ttWjZeO0MfmyHEw1i/JvqiHe1RqonjqnR4Bf40rRxD3vM JuJmpJkYNqknqw6CQi9rE31vQnDhW4mYypeH62VBtreQ1fyZjkyrgoRs3nbU8XgVy/T3 Dsf4eOTwbAkN3GcA/ZOEcxVKoaT1QpmfOcloMzAQzkEbYU7Xh+tQ9Qk1gt1Okxh1Xjaj 3Yqc20PlNZ1nE9Z1NdSQD/WZhDA6MDfKmaXBlN8Rgqajw1BMDq/XrTNCe+flNOH++2Qw PqhQUSweS+7pBlr5Y9l9k4IswKxkSDhyKgpvfuLCCr5Z/Uwvg+tdOXYc/hGAhwpYvEjE l5JQ== X-Received: by 10.52.106.199 with SMTP id gw7mr5042242vdb.99.1374618020210; Tue, 23 Jul 2013 15:20:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.161.101 with HTTP; Tue, 23 Jul 2013 15:20:00 -0700 (PDT) X-Originating-IP: [151.62.11.219] In-Reply-To: References: <5203089E-AEE3-4936-A438-20A903C8A215@cloudera.com> <7FBA2A9E-680D-4495-BAA6-90B49F087E7D@cloudera.com> From: Flavio Pompermaier Date: Wed, 24 Jul 2013 00:20:00 +0200 Message-ID: Subject: Re: SolrCell help! To: user Cc: Tom White Content-Type: multipart/alternative; boundary=bcaec54857c0cc011504e2353255 X-Gm-Message-State: ALoCoQlLej25MSc97vurKp0ja6R2zTo7Q7OANhmk4iqsWnhdXGgC0bHLzJJtVRKG1ePK4UuffQf8 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec54857c0cc011504e2353255 Content-Type: text/plain; charset=ISO-8859-1 Unfortunately now I'm not at work..I'll try as soon as possible! On Tue, Jul 23, 2013 at 7:48 PM, Wolfgang Hoschek wrote: > Seems like a transient mvn repo problem. Can you try again? > > Wolfgang. > > On Jul 23, 2013, at 1:36 AM, Flavio Pompermaier wrote: > > > Still problems when building CDK Data Core Module 0.4.2-SNAPSHOT. Maven > hangs at: > > > > Downloading: > https://repository.cloudera.com/artifactory/cloudera-repos/com/twitter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml > > Downloading: > https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml > > lug 23, 2013 10:35:41 AM > org.apache.commons.httpclient.HttpMethodDirector executeWithRetry > > INFO: I/O exception (java.net.ConnectException) caught when processing > request: Connessione scaduta > > lug 23, 2013 10:35:41 AM > org.apache.commons.httpclient.HttpMethodDirector executeWithRetry > > INFO: I/O exception (java.net.ConnectException) caught when processing > request: Connessione scaduta > > lug 23, 2013 10:35:41 AM > org.apache.commons.httpclient.HttpMethodDirector executeWithRetry > > INFO: Retrying request > > lug 23, 2013 10:35:41 AM > org.apache.commons.httpclient.HttpMethodDirector executeWithRetry > > INFO: Retrying request > > > > > > > > On Tue, Jul 23, 2013 at 10:33 AM, Flavio Pompermaier < > pompermaier@okkam.it> wrote: > > Sorry, this is caused of our mirror..I remove it and I'll retry.. > > > > > > On Tue, Jul 23, 2013 at 10:31 AM, Flavio Pompermaier < > pompermaier@okkam.it> wrote: > > > > I still get this error: > > > > Failed to read artifact descriptor for > commons-daemon:commons-daemon:jar:1.0.3: Could not transfer artifact > commons-daemon:commons-daemon:pom:1.0.3 from/to repo ( > http://dev.okkam.it/artifactory/repo): Failed to transfer file: > http://dev.okkam.it/artifactory/repo/commons-daemon/commons-daemon/1.0.3/commons-daemon-1.0.3.pom. > Return code is: 409 -> [Help 1] > > > > > > On Tue, Jul 23, 2013 at 10:22 AM, Wolfgang Hoschek < > whoschek@cloudera.com> wrote: > > Tests pass on java 6 but fail on java 7. Correspondingly, I have filed > https://issues.cloudera.org/browse/CDK-80. We'll fix it. Meanwhile, > please try java 6. > > > > Wolfgang. > > > > On Jul 23, 2013, at 12:51 AM, Flavio Pompermaier wrote: > > > > > I tried to download the current trunk but it doesn't compile..for > example it hangs on > > > > https://repository.cloudera.com/artifactory/cloudera-repos/com/twitter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml > > > that doesn't exists anymore.. > > > > > > > > > On Mon, Jul 22, 2013 at 11:14 PM, Flavio Pompermaier < > pompermaier@okkam.it> wrote: > > > You couldn't be more precise ;) > > > > > > Thanks, > > > Flavio > > > > > > On Mon, Jul 22, 2013 at 11:02 PM, Wolfgang Hoschek < > whoschek@cloudera.com> wrote: > > > Docs for the xquery and xslt morphline commands are here (look for > xquery"): > https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence > > > > > > Example morphlines for the new xquery and xslt commands are here: > https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-morphlines > > > > > > Sample input data is here: > https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-documents > > > > > > Unit tests are here: > https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-saxon/src/test/java/com/cloudera/cdk/morphline/saxon/SaxonMorphlineTest.java > > > > > > Wolfgang. > > > > > > On Jul 22, 2013, at 1:41 PM, Flavio Pompermaier wrote: > > > > > > > Ok, I'll try to follow the code! Just one last thing: for > morphine-neon I manage to find the test (in cdk repository) but for the new > xslt and xquery I'm not able to find the tests code..could you give me an > hook? > > > > > > > > On Mon, Jul 22, 2013 at 9:21 PM, Wolfgang Hoschek < > whoschek@cloudera.com> wrote: > > > > There are many tests for this in the morphlines repo. > > > > > > > > Wolfgang. > > > > > > > > On Jul 22, 2013, at 11:43 AM, Flavio Pompermaiert wrote: > > > > > > > > > > > > > > Thank you for the great support Wolfgang! > > > > > Flume + Morphlines is undoubtedly an exciting road but its taking > me too much time :( > > > > > Do you think you could add some more tests including readJson and > the new xquery and xslt in trunk? > > > > > > > > > > Best, > > > > > Flavio > > > > > On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek < > whoschek@cloudera.com> wrote: > > > > > Looks like the DcXMLParser spits out a metadata field called > "title" and another title as part of the Tika XML stream. That metadata > field is then added to the solr document by solrcell. If you add "title" to > the captures the title from the XML stream gets added as well by solrcell. > > > > > > > > > > JSON support has been released in morphlines-0.4.1 (which flume > trunk is now depending on): > http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson > > > > > > > > > > Note that Tika XML doesn't really support/capture XPath extraction > with SolrCell. We have added proper support for reading, extracting and > transforming XML and HTML with XPath, XQuery and XSLT on the current > morphlines trunk (not yet released), similar to the way we already support > JSON and Avro. This should make XML handling a lot more straightforward, > and make the very limited XML SolrCell approach obsolete. Look for the new > "xquery" and "xslt" command in > https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence > > > > > > > > > > Meanwhile, consider using these new commands or, use JSON or Avro, > or write your own custom morphline commands that extract whatever you want > from your XML data. > > > > > > > > > > Wolfgang. > > > > > > > > > > On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote: > > > > > > > > > > > Hi to all, > > > > > > I'm trying to understand how to "master" Morphline configuration > files in order to put some data into Solr but I'm facing some problem with > TestMorphlineSolrSink. This is what I done: > > > > > > > > > > > > 1) Since I want to index the title of the testXML.xml (i.e. > "Tika test document") so I commented out all the parsers except > org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core metadata) > > > > > > 2) In schema.xml I added the following field: > > > > > > stored="true" multiValued="false" /> > > > > > > > > > > > > But: > > > > > > - If I don't add anything to fmap or capture everything works > fine but I don't understand why (who fills that field?). If instead I add > to capture title or/and to famp title: title (or dc_title:title) Solr > complains that 2 values are retrieved for 'title' (debugging the values I > see the title and one empty value in the 'title\ metadata array...). > > > > > > Thus, the problem is that everything works magically if the > field is named title, but if I change its name to something like doc_title > there's no way to make it non-multivalued. Am I right? How can I fix this > problem? > > > > > > - I'd like to manage JSON files..How can I map JSON fields to > Solr fields? Could someone give a simple example? > > > > > > > > > > > > Best, > > > > > > Flavio > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Flavio Pompermaier > > Development Department > > _______________________________________________ > > OKKAMSrl - www.okkam.it > > > > Phone: +(39) 0461 283 702 > > Fax: + (39) 0461 186 6433 > > Email: f.pompermaier@okkam.it > > Headquarters: Trento (Italy), fraz. Villazzano, Salita dei Molini 2 > > Registered office: Trento (Italy), via Segantini 23 > > > > Confidentially notice. This e-mail transmission may contain legally > privileged and/or confidential information. Please do not read it if you > are not the intended recipient(S). Any use, distribution, reproduction or > disclosure by any other person is strictly prohibited. If you have received > this e-mail in error, please notify the sender and destroy the original > transmission and its attachments without reading or saving it in any manner. > > > > > > > > > > > > --bcaec54857c0cc011504e2353255 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Unfortunately now I'm not a= t work..I'll try as soon as possible!

On Tue, Jul 23, 2013 at 7:48 PM, Wolfgang Hoschek <whoschek@cloudera.= com> wrote:
Seems like a transient mvn repo problem. Can you try again= ?

Wolfgang.

On Jul 23, 2013, at 1:36 AM, Flavio Pompermaier wrote:

> Still problems when building CDK Data Core Module 0.4.2-SNAPSHOT. Mave= n hangs at:
>
> Downloading: https://repository.cloudera.com/artifactory/cloudera-repos/= com/twitter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml
> Downloading: https://oss.sonatype.org/content/repositories/snapshots/com/tw= itter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml
> lug 23, 2013 10:35:41 AM org.apache.commons.httpclient.HttpMethodDirec= tor executeWithRetry
> INFO: I/O exception (java.net.ConnectException) caught when processing= request: Connessione scaduta
> lug 23, 2013 10:35:41 AM org.apache.commons.httpclient.HttpMethodDirec= tor executeWithRetry
> INFO: I/O exception (java.net.ConnectException) caught when processing= request: Connessione scaduta
> lug 23, 2013 10:35:41 AM org.apache.commons.httpclient.HttpMethodDirec= tor executeWithRetry
> INFO: Retrying request
> lug 23, 2013 10:35:41 AM org.apache.commons.httpclient.HttpMethodDirec= tor executeWithRetry
> INFO: Retrying request
>
>
>
> On Tue, Jul 23, 2013 at 10:33 AM, Flavio Pompermaier <pompermaier@okkam.it> wrote:
> Sorry, this is caused of our mirror..I remove it and I'll retry..<= br> >
>
> On Tue, Jul 23, 2013 at 10:31 AM, Flavio Pompermaier <pompermaier@okkam.it> wrote:
>
> I still get this error:
>
> =A0Failed to read artifact descriptor for commons-daemon:commons-daemo= n:jar:1.0.3: Could not transfer artifact commons-daemon:commons-daemon:pom:= 1.0.3 from/to repo (http://dev.okkam.it/artifactory/repo): Failed to transfer f= ile: http://dev.okkam.= it/artifactory/repo/commons-daemon/commons-daemon/1.0.3/commons-daemon-1.0.= 3.pom. Return code is: 409 -> [Help 1]
>
>
> On Tue, Jul 23, 2013 at 10:22 AM, Wolfgang Hoschek <whoschek@cloudera.com> wrote:
> Tests pass on java 6 but fail on java 7. Correspondingly, I have filed= ht= tps://issues.cloudera.org/browse/CDK-80. We'll fix it. Meanwhile, p= lease try java 6.
>
> Wolfgang.
>
> On Jul 23, 2013, at 12:51 AM, Flavio Pompermaier wrote:
>
> > I tried to download the current trunk but it doesn't compile.= .for example it hangs on
> > https://repository.cloudera.com/artifactory/cloudera-repos/com/twit= ter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml
> > that doesn't exists anymore..
> >
> >
> > On Mon, Jul 22, 2013 at 11:14 PM, Flavio Pompermaier <pompermaier@okkam.it> wrote:
> > You couldn't be more precise ;)
> >
> > Thanks,
> > Flavio
> >
> > On Mon, Jul 22, 2013 at 11:02 PM, Wolfgang Hoschek <whoschek@cloudera.com> wrote:
> > Docs for the xquery and xslt morphline commands are here (look fo= r xquery"): https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/= site/confluence/morphlinesReferenceGuide.confluence
> >
> > Example morphlines for the new xquery and xslt commands are here:= http= s://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon= /src/test/resources/test-morphlines
> >
> > Sample input data is here: https://github.com/cloudera/cdk/tree/master/= cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-documents > >
> > Unit tests are here: https://githu= b.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-saxon/src/test= /java/com/cloudera/cdk/morphline/saxon/SaxonMorphlineTest.java
> >
> > Wolfgang.
> >
> > On Jul 22, 2013, at 1:41 PM, Flavio Pompermaier wrote:
> >
> > > Ok, I'll try to follow the code! Just one last thing: fo= r morphine-neon I manage to find the test (in cdk repository) but for the n= ew xslt and xquery I'm not able to find the tests code..could you give = me an hook?
> > >
> > > On Mon, Jul 22, 2013 at 9:21 PM, Wolfgang Hoschek <whoschek@cloudera.com> wrote:
> > > There are many tests for this in the morphlines repo.
> > >
> > > Wolfgang.
> > >
> > > On Jul 22, 2013, at 11:43 AM, Flavio Pompermaiert wrote:
> > >
> > > >
> > > > Thank you for the great support Wolfgang!
> > > > Flume + Morphlines is undoubtedly an exciting road but = its taking me too much time :(
> > > > Do you think you could add some more tests including re= adJson and the new xquery and xslt in trunk?
> > > >
> > > > Best,
> > > > Flavio
> > > > On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek <<= a href=3D"mailto:whoschek@cloudera.com">whoschek@cloudera.com> wrote= :
> > > > Looks like the DcXMLParser spits out a metadata field c= alled "title" and another title as part of the Tika XML stream. T= hat metadata field is then added to the solr document by solrcell. If you a= dd "title" to the captures the title from the XML stream gets add= ed as well by solrcell.
> > > >
> > > > JSON support has been released in morphlines-0.4.1 (whi= ch flume trunk is now depending on): http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphli= nesReferenceGuide.html#readJson
> > > >
> > > > Note that Tika XML doesn't really support/capture X= Path extraction with SolrCell. We have added proper support for reading, ex= tracting and transforming XML and HTML with XPath, XQuery and XSLT on the c= urrent morphlines trunk (not yet released), similar to the way we already s= upport JSON and Avro. This should make XML handling a lot more straightforw= ard, and make the very limited XML SolrCell approach obsolete. Look for the= new "xquery" and "xslt" command in https://github.com/cloud= era/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceG= uide.confluence
> > > >
> > > > Meanwhile, consider using these new commands or, use JS= ON or Avro, or write your own custom morphline commands that extract whatev= er you want from your XML data.
> > > >
> > > > Wolfgang.
> > > >
> > > > On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote:<= br> > > > >
> > > > > Hi to all,
> > > > > I'm trying to understand how to "master&q= uot; Morphline configuration files in order to put some data into Solr but = I'm facing some problem with TestMorphlineSolrSink. This is what I done= :
> > > > >
> > > > > 1) Since I want to index the title of the testXML.= xml (i.e. "Tika test document") so I commented out all the parser= s except org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core m= etadata)
> > > > > 2) In schema.xml I added the following field:
> > > > > =A0 =A0 <field name=3D"title" type=3D= "text_en" indexed=3D"true" stored=3D"true" mu= ltiValued=3D"false" />
> > > > >
> > > > > But:
> > > > > =A0- If I don't add anything to fmap or captur= e everything works fine but I don't understand why (who fills that fiel= d?). If instead I add to capture title or/and to famp title: title (or dc_t= itle:title) Solr complains that 2 values are retrieved for 'title' = (debugging the values I see the title and one empty value in the 'title= \ metadata array...).
> > > > > Thus, the problem is that everything works magical= ly if the field is named title, but if I change its name to something like = doc_title there's no way to make it non-multivalued. =A0Am I right? How= can I fix this problem?
> > > > > - I'd like to manage JSON files..How can I map= JSON fields to Solr fields? Could someone give a simple example?
> > > > >
> > > > > Best,
> > > > > Flavio
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
> >
>
>
>
>
>
>
>
> --
>
> Flavio Pompermaier
> Development Department
> _______________________________________________
> OKKAMSrl - www.okkam= .it
>
> Phone: +(39) 0461 283 702
> Fax: + (39) 0461 186 6433
> Email: f.pompermaier@okkam.i= t
> Headquarters: Trento (Italy), fraz. Villazzano, Salita dei Molini 2 > Registered office: Trento (Italy), via Segantini 23
>
> Confidentially notice. This e-mail transmission may contain legally pr= ivileged and/or confidential information. Please do not read it if you are = not the intended recipient(S). Any use, distribution, reproduction or discl= osure by any other person is strictly prohibited. If you have received this= e-mail in error, please notify the sender and destroy the original transmi= ssion and its attachments without reading or saving it in any manner.
>
>
>
>
>

--bcaec54857c0cc011504e2353255--