asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <bamou...@gmail.com>
Subject Re: Limitation of the current TweetParser
Date Tue, 23 Feb 2016 17:37:32 GMT
+1

On Tue, Feb 23, 2016 at 8:26 PM, Chen Li <chenli@gmail.com> wrote:

> If the fields provided by twitter4j are good enough, I prefer option 1.  It
> would be good to avoid a separate request to Twitter due to the overhead.
>
> Chen
>
> On Tue, Feb 23, 2016 at 12:13 AM, Jianfeng Jia <jianfeng.jia@gmail.com>
> wrote:
>
> > Good to know there is another request inside twitter4j.
> > I think given the popularity of twitter4j, if we can parse all the fields
> > in list 1 to ADM then it will be good enough.
> >
> > > On Feb 23, 2016, at 12:00 AM, abdullah alamoudi <bamousaa@gmail.com>
> > wrote:
> > >
> > > Jianfeng,
> > > We are using twitter4j api to get tweets as Status objects. I believe
> > that
> > > twitter4j itself discards the original JSON when creating Status
> objects.
> > > They provide a method to get the full json:
> > >
> > > String rawJSON = DataObjectFactory.getRawJSON(status);
> > >
> > > This method however sends another request to Twitter to get the
> original
> > > JSON.
> > > We have a few choices:
> > > 1. be okay with what twitter4j keeps {CreatedAt, Id, Text, Source,
> > > isTruncated, InReplyToStatusId, InReplyToUserId, InReplyToScreenName,
> > > GeoLocation, Place, isFavorited, isRetweeted, FavoriteCount, User,
> > > isRetweet, RetweetedStatus, Contributors, RetweetCount,
> isRetweetedByMe,
> > > CurrentUserRetweetId, PossiblySensitive, Lang,Scopes,
> > WithheldInCountries}.
> > > However this means that we will not get additional feeds in case the
> > actual
> > > data structure change. We can actually change this into JSON object
> using
> > > the method above and then we can use our ADM parser to parse it.
> > >
> > > 2. Instead of relying on twitter4j, we should be able to get the JSON
> > > objects directly using http requests to twitter. This way always gives
> us
> > > the complete JSON object as it comes from twitter.com and we will get
> > new
> > > fields the moment they are added.
> > >
> > > I think either way should be fine and I actually think that we should
> > stick
> > > to twitter4j for now and still use a specialized tweet parser which
> will
> > > simply transform the objects fields into ADM fields unless there is a
> > > strong need for fields that are not covered by the list in (1).
> > >
> > > My 2c,
> > > Abdullah.
> > >
> > >
> > >
> > > On Tue, Feb 23, 2016 at 3:46 AM, Jianfeng Jia <jianfeng.jia@gmail.com>
> > > wrote:
> > >
> > >> Dear devs,
> > >>
> > >> TwitterFeedAdapter is nice, but the internal TweetParser have some
> > >> limitations.
> > >> 1. We only pick a few JSON field, e.g. user, geolocation, message
> > field. I
> > >> need the place field. Also there are also some other fields the other
> > >> application may also interested in.
> > >> 2. The text fields always call getNormalizedString() to filter out the
> > >> non-ascii chars, which is a big loss of information. Even for the
> > English
> > >> txt there are emojis which are not “nomal”
> > >>
> > >> Apparently we can add the entire twitter structure into this parser.
> I’m
> > >> wondering if the current one-to-one mapping between Adapter and Parser
> > >> design is the best approach? The twitter data itself changes. Also
> there
> > >> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
> > >> Weibo, Reddit ….  Could we have a general approach for all these data
> > >> sources?
> > >>
> > >> I’m thinking to have some field level JSON to ADM parsers
> > >> (int,double,string,binary,point,time,polygon…). Then by given the
> schema
> > >> option through Adapter we can easily assemble the field into one
> record.
> > >> The schema option could be a field mapping between original JSON id
> and
> > the
> > >> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..}
}. As such,
> > we
> > >> don’t have to write the specific parser for different data source.
> > >>
> > >> Another thoughts is to just give the JSON object as it is, and rely on
> > the
> > >> user’s UDF to parse the data. Again, even in this case, user can
> > >> selectively override several field parsers that are different from
> ours.
> > >>
> > >> Any thoughts?
> > >>
> > >>
> > >> Best,
> > >>
> > >> Jianfeng Jia
> > >> PhD Candidate of Computer Science
> > >> University of California, Irvine
> > >>
> > >>
> >
> >
> >
> > Best,
> >
> > Jianfeng Jia
> > PhD Candidate of Computer Science
> > University of California, Irvine
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message