mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie
Date Sun, 30 Mar 2014 16:29:14 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954737#comment-13954737
] 

Sergey commented on MAHOUT-1498:
--------------------------------

So I've replaced all 
{code}
DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
{code}
with
{code}
DistributedCache.addCacheFile(dictionaryFilePath.toUri(), conf);
{code}

Now my jars are not thrown away from distirubted cache. These jars are used in subsequent
MR job submission.
Also I've modified several reducers. Reducers did expect to get single file in distCache.
Here is an example:
{code}
//TFPartialVectorReducer
@Override
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        Configuration conf = context.getConfiguration();
        URI[] localFiles = DistributedCache.getCacheFiles(conf);
        Preconditions.checkArgument(localFiles != null && localFiles.length >=
1,
                "missing paths from the DistributedCache");

        dimension = conf.getInt(PartialVectorMerger.DIMENSION, Integer.MAX_VALUE);
        sequentialAccess = conf.getBoolean(PartialVectorMerger.SEQUENTIAL_ACCESS, false);
        namedVector = conf.getBoolean(PartialVectorMerger.NAMED_VECTOR, false);
        maxNGramSize = conf.getInt(DictionaryVectorizer.MAX_NGRAMS, maxNGramSize);

        //Path dictionaryFile = new Path(localFiles[0].getPath());
        Path dictionaryFile = getPathToDictionaryFile(localFiles);
        // key is word value is id
        for (Pair<Writable, IntWritable> record
                : new SequenceFileIterable<Writable, IntWritable>(dictionaryFile, true,
conf)) {
            dictionary.put(record.getFirst().toString(), record.getSecond().get());
        }
    }

    private Path getPathToDictionaryFile(URI[] localFiles){
        for(URI distCacheFile : localFiles){
            System.out.println("getPathToDictionaryFile ::: " + (distCacheFile == null ? null
: distCacheFile.toString()));
            if(distCacheFile!=null && distCacheFile.toString().contains("dictionary.file")){
                System.out.println("getPathToDictionaryFile ::: looks like ["+distCacheFile+"]
is a dictionary we need");
                return new Path(distCacheFile.getPath());
            }
        }
        URI lastUri = localFiles[localFiles.length-1];
        System.out.println("getPathToDictionaryFile ::: didn't find dict file. Trying to return
the last one ["+lastUri.toString()+"]");
        return new Path(lastUri.getPath());
    }
{code}

I'm not sure is it good or bad, and now my oozie action runs without any problems. Here is
a workflow action:
{code}
<action name="run-mahout-item_info_catalog_category_id">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/staging/working/mahout/run-mahout-item_info_catalog_category_id/out"
/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.queue.name</name>
                    <value>default</value>
                </property>
            </configuration>
            <main-class>org.apache.mahout.vectorizer.SparseVectorsFromSequenceFilesDirtyHack</main-class>

            <arg>-Ddfs.blocksize=1m</arg>

            <arg>--input</arg>
            <arg>${nameNode}/staging/working/mahout/prepare-item_info_catalog_category_id/out</arg>

            <arg>--output</arg>
            <arg>${nameNode}/staging/working/mahout/run-mahout-item_info_catalog_category_id/out</arg>

            <arg>-ow</arg>

            <arg>-x</arg>
            <arg>70</arg>

            <arg>-ng</arg>
            <arg>4</arg>

            <arg>-n</arg>
            <arg>2</arg>

            <arg>-seq</arg>

            <arg>-wt</arg>
            <arg>TFIDF</arg>
        </java>
        <ok to="mahout-join-node"/>
        <error to="kill"/>
    </action>
{code}



> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie
> -----------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1498
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1498
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: mahout-core-0.7-cdh4.4.0.jar
>            Reporter: Sergey
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles],
main() threw exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>          String sfiles = StringUtils.uriToString(files);
>          conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message