mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mat Kelcey (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly
Date Wed, 28 Dec 2011 06:34:30 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mat Kelcey updated MAHOUT-937:
------------------------------

    Description: 
The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations)
uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 

This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets()
but this call is never made (not sure why? is this method expected to be called by the Hadoop
framework itself?) 

The offset not being set results in getPartition always returning 0 and so all intermediate
data is sent to the one reducer. 

I couldn't quite understand what this partitioning was meant to be doing, but simply hashing
the Grams primary string representation (ie without the leading 'type' byte) does what is
required...

{code}
public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

  @Override
  public int getPartition(GramKey key, Gram value, int numPartitions) {
    // exclude first byte which is the key type 
    byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
    System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length);

    int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
    return (hash & Integer.MAX_VALUE) % numPartitions;    
  }
  
}
{code}




  was:
The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations)
uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 

This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets()
but this call is never made (not sure why? is this method expected to be called by the Hadoop
framework itself?) 

The offset not being set results in getPartition always returning 0 and so all intermediate
data is sent to the one reducer. 

I couldn't quite understand what this partitioning was meant to be doing, but simply hashing
the Grams primary string representation (ie without the leading 'type' byte) does what is
required...

public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

  @Override
  public int getPartition(GramKey key, Gram value, int numPartitions) {
    // exclude first byte which is the key type 
    byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
    System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length);

    int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
    return (hash & Integer.MAX_VALUE) % numPartitions;    
  }
  
}




    
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: GramKeyPartitioner.java
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations)
uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call
to setOffsets() but this call is never made (not sure why? is this method expected to be called
by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate
data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply
hashing the Grams primary string representation (ie without the leading 'type' byte) does
what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length);

>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message