asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <>
Subject Re: Limitation of the current TweetParser
Date Tue, 23 Feb 2016 08:00:51 GMT
We are using twitter4j api to get tweets as Status objects. I believe that
twitter4j itself discards the original JSON when creating Status objects.
They provide a method to get the full json:

String rawJSON = DataObjectFactory.getRawJSON(status);

This method however sends another request to Twitter to get the original
We have a few choices:
1. be okay with what twitter4j keeps {CreatedAt, Id, Text, Source,
isTruncated, InReplyToStatusId, InReplyToUserId, InReplyToScreenName,
GeoLocation, Place, isFavorited, isRetweeted, FavoriteCount, User,
isRetweet, RetweetedStatus, Contributors, RetweetCount, isRetweetedByMe,
CurrentUserRetweetId, PossiblySensitive, Lang,Scopes, WithheldInCountries}.
However this means that we will not get additional feeds in case the actual
data structure change. We can actually change this into JSON object using
the method above and then we can use our ADM parser to parse it.

2. Instead of relying on twitter4j, we should be able to get the JSON
objects directly using http requests to twitter. This way always gives us
the complete JSON object as it comes from and we will get new
fields the moment they are added.

I think either way should be fine and I actually think that we should stick
to twitter4j for now and still use a specialized tweet parser which will
simply transform the objects fields into ADM fields unless there is a
strong need for fields that are not covered by the list in (1).

My 2c,

On Tue, Feb 23, 2016 at 3:46 AM, Jianfeng Jia <>

> Dear devs,
> TwitterFeedAdapter is nice, but the internal TweetParser have some
> limitations.
> 1. We only pick a few JSON field, e.g. user, geolocation, message field. I
> need the place field. Also there are also some other fields the other
> application may also interested in.
> 2. The text fields always call getNormalizedString() to filter out the
> non-ascii chars, which is a big loss of information. Even for the English
> txt there are emojis which are not “nomal”
> Apparently we can add the entire twitter structure into this parser. I’m
> wondering if the current one-to-one mapping between Adapter and Parser
> design is the best approach? The twitter data itself changes. Also there
> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
> Weibo, Reddit ….  Could we have a general approach for all these data
> sources?
> I’m thinking to have some field level JSON to ADM parsers
> (int,double,string,binary,point,time,polygon…). Then by given the schema
> option through Adapter we can easily assemble the field into one record.
> The schema option could be a field mapping between original JSON id and the
> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we
> don’t have to write the specific parser for different data source.
> Another thoughts is to just give the JSON object as it is, and rely on the
> user’s UDF to parse the data. Again, even in this case, user can
> selectively override several field parsers that are different from ours.
> Any thoughts?
> Best,
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message