Use case is simple, count unique user in for in a window slide, and I found the common solutions
over the Internet is to use HashSet to fliter the duplicated user,like this
public class Distinct extends BaseFilter {
private static final long serialVersionUID = 1L;
private Set<String> distincter = Collections.synchronizedSet(new HashSet<String>());
@Override
public boolean isKeep(TridentTuple tuple) {
String id = this.getId(tuple);
return distincter.add(id);
}
public String getId(TridentTuple t) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < t.size(); i++) {
sb.append(t.getString(i));
}
return sb.toString();
}
}
However, the HashSet is stored in memory, when the data grows to a very large level, I think
it will cause a OOM.
So is there a scalable solution?
2014-07-14
唐思成
|