asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianfeng Jia <jianfeng....@gmail.com>
Subject Limitation of the current TweetParser
Date Tue, 23 Feb 2016 00:46:50 GMT
Dear devs,

TwitterFeedAdapter is nice, but the internal TweetParser have some limitations. 
1. We only pick a few JSON field, e.g. user, geolocation, message field. I need the place
field. Also there are also some other fields the other application may also interested in.
2. The text fields always call getNormalizedString() to filter out the non-ascii chars, which
is a big loss of information. Even for the English txt there are emojis which are not “nomal”

Apparently we can add the entire twitter structure into this parser. I’m wondering if the
current one-to-one mapping between Adapter and Parser design is the best approach? The twitter
data itself changes. Also there are a lot of interesting open data resources, e.g. Instagram,FaceBook,
Weibo, Reddit ….  Could we have a general approach for all these data sources? 

I’m thinking to have some field level JSON to ADM parsers (int,double,string,binary,point,time,polygon…).
Then by given the schema option through Adapter we can easily assemble the field into one
record. The schema option could be a field mapping between original JSON id and the ADM type,
e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to
write the specific parser for different data source. 

Another thoughts is to just give the JSON object as it is, and rely on the user’s UDF to
parse the data. Again, even in this case, user can selectively override several field parsers
that are different from ours. 

Any thoughts?


Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message