lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
Date Sun, 08 Dec 2013 17:06:38 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13842555#comment-13842555
] 

Steve Rowe commented on SOLR-1301:
----------------------------------

The Maven Jenkins build on trunk has been failing for a while because {{com.sun.jersey:jersey-bundle:1.8}},
a morphlines-core dependency, causes {{ant validate-maven-dependencies}} to fail - here's
a log excerpt from the most recent failure [https://builds.apache.org/job/Lucene-Solr-Maven-trunk/1046/console]:

{noformat}
     [echo] Building solr-map-reduce...

-validate-maven-dependencies.init:

-validate-maven-dependencies:
[artifact:dependencies] [INFO] snapshot org.apache.solr:solr-cell:5.0-SNAPSHOT: checking for
updates from maven-restlet
[artifact:dependencies] [INFO] snapshot org.apache.solr:solr-cell:5.0-SNAPSHOT: checking for
updates from releases.cloudera.com
[artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-cell:5.0-SNAPSHOT:
checking for updates from maven-restlet
[artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-cell:5.0-SNAPSHOT:
checking for updates from releases.cloudera.com
[artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-core:5.0-SNAPSHOT:
checking for updates from maven-restlet
[artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-core:5.0-SNAPSHOT:
checking for updates from releases.cloudera.com
[artifact:dependencies] An error has occurred while processing the Maven artifact tasks.
[artifact:dependencies]  Diagnosis:
[artifact:dependencies] 
[artifact:dependencies] Unable to resolve artifact: Unable to get dependency information:
Unable to read the metadata file for artifact 'com.sun.jersey:jersey-bundle:jar': Cannot find
parent: com.sun.jersey:jersey-project for project: null:jersey-bundle:jar:null for project
null:jersey-bundle:jar:null
[artifact:dependencies]   com.sun.jersey:jersey-bundle:jar:1.8
[artifact:dependencies] 
[artifact:dependencies] from the specified remote repositories:
[artifact:dependencies]   central (http://repo1.maven.org/maven2),
[artifact:dependencies]   releases.cloudera.com (https://repository.cloudera.com/artifactory/libs-release),
[artifact:dependencies]   maven-restlet (http://maven.restlet.org),
[artifact:dependencies]   Nexus (http://repository.apache.org/snapshots)
[artifact:dependencies] 
[artifact:dependencies] Path to dependency: 
[artifact:dependencies] 	1) org.apache.solr:solr-map-reduce:jar:5.0-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies] 
[artifact:dependencies] Not a v4.0.0 POM. for project com.sun.jersey:jersey-project at /home/hudson/.m2/repository/com/sun/jersey/jersey-project/1.8/jersey-project-1.8.pom
{noformat}

I couldn't reproduce locally.

Turns out the parent POM in question, at {{/home/hudson/.m2/repository/com/sun/jersey/jersey-project/1.8/jersey-project-1.8.pom}},
has the wrong contents:

{noformat}
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/0.6.39</center>
</body>
</html>
{noformat}

I replaced this by manually downloading the correct POM and it's checksum file from Maven
Central and putting them in the hudson user's local Maven repository.

[~markrmiller@gmail.com]: While investigating this failure, I tried dropping the triggering
Ivy dependency com.sun.jersey:jersey-bundle, and all enabled tests succeed.  Okay with you
to drop this dependency?  The description from the POM says:

{code:xml}
<description>
A bundle containing code of all jar-based modules that provide JAX-RS and Jersey-related features.
Such a bundle is *only intended* for developers that do not use Maven's dependency system.
The bundle does not include code for contributes, tests and samples.
</description>
{code}

Sounds like it's a sneaky replacement for transitive dependencies?  IMHO, if we need some
of the classes this jar provides, we should declare direct dependencies on the appropriate
artifacts.

> Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Andrzej Bialecki 
>            Assignee: Mark Miller
>             Fix For: 5.0, 4.7
>
>         Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch,
SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar,
hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
log4j-1.2.15.jar
>
>
> This patch contains  a contrib module that provides distributed indexing (using Hadoop)
to Solr EmbeddedSolrServer. The idea behind this module is twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat
consumes data produced by reduce tasks directly, without storing it in intermediate files.
Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts
as there are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn
uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer,
and it also instantiates an implementation of SolrDocumentConverter, which is responsible
for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch,
which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the
OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home directory, from
which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories as there
were reduce tasks. The output shards are placed in the output directory on the default filesystem
(e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally,
users can specify the number of reduce tasks, in particular 1 reduce task, in which case the
output will consist of a single shard.
> An example application is provided that processes large CSV files and uses this API.
It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should
put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and approved
for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message