apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vlad Rozov <v.ro...@datatorrent.com>
Subject Re: balanced of Stream Codec
Date Tue, 18 Oct 2016 00:02:21 GMT
Using different hash function will help only in case data is equally 
distributed across categories. In many cases data is skewed and some 
categories occur more frequently than others. In such case generic hash 
function will not help. Can you try to sample data and see if the data 
is equally distributed across categories?

Vlad


On 10/16/16 10:40, Pramod Immaneni wrote:
> Hi Sunil,
>
> Have you tried an alternate hashing function other than java hashcode 
> that might provide a more uniform distribution of your data? The 
> google guava library provides a set of hashing strategies, like murmur 
> hash, that is reported to have lesser hash collisions in different 
> cases. Below is a link explaining these from their website
>
> https://github.com/google/guava/wiki/HashingExplained
>
> Here is a link where someone has done a comparative study of different 
> hashing functions
> http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
>
> If you end up choosing hashing function from google guava library, 
> make sure you use the documentation from guava version 11.0 as this 
> version of guava is already included in Hadoop classpath.
>
> Thanks
>
> On Fri, Oct 14, 2016 at 1:17 PM, Sunil Parmar 
> <sparmar@threatmetrix.com <mailto:sparmar@threatmetrix.com>> wrote:
>
>     We’re using Stream codec to consistently / parallel processing of
>     the data across the operator partitions. Our requirement is to
>     serialize processing of the data based on particular tuple
>     attribute let’s call it ‘catagory_name’ . In order to achieve the
>     parallel processing of different category names we’re written our
>     stream codec as following.
>
>        public class CatagoryStreamCodec extends
>     KryoSerializableStreamCodec<Object> {
>
>     private static final long serialVersionUID = -687991492884005033L;
>
>     @Override
>
>     public int getPartition(Object in) {
>
>     try {
>
>     InputTuple tuple = (InputTuple) in;
>
>     String partitionKehy = tuple.getName();
>
>     if(partitionKehy != null) {
>
>     return partitionKehy.hashCode();
>
>     }
>
>         }
>
>        }
>
>     It’s working as expected *but *we observed inconsistent partitions
>     when we run this in production env with 20 partitioner of the
>     operator following the codec in the dag.
>
>       * Some operator instance didn’t process any data
>       * Some operator instance process as many tuples as combined
>         everybody else
>
>
>     Questions :
>
>       * getPartition method supposed to return the actual partition or
>         just some lower bit used for deciding partition ?
>       * Number of partitions is known to application properties and
>         can vary between deployments or environments. Is it best
>         practice to use that property in the stream codec ?
>       * Any recommended hash function for getting consistent
>         variations in the lower bit with less variety of data. we’ve
>         ~100+ categories and I’m thinking to have 10+ operator
>         partitions.
>
>
>     Thanks,
>     Sunil
>
>


Mime
View raw message