asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Limitation of the current TweetParser
Date Tue, 23 Feb 2016 18:01:48 GMT
+1

On 2/23/16 9:26 AM, Chen Li wrote:
> If the fields provided by twitter4j are good enough, I prefer option 1.  It
> would be good to avoid a separate request to Twitter due to the overhead.
>
> Chen
>
> On Tue, Feb 23, 2016 at 12:13 AM, Jianfeng Jia <jianfeng.jia@gmail.com>
> wrote:
>
>> Good to know there is another request inside twitter4j.
>> I think given the popularity of twitter4j, if we can parse all the fields
>> in list 1 to ADM then it will be good enough.
>>
>>> On Feb 23, 2016, at 12:00 AM, abdullah alamoudi <bamousaa@gmail.com>
>> wrote:
>>> Jianfeng,
>>> We are using twitter4j api to get tweets as Status objects. I believe
>> that
>>> twitter4j itself discards the original JSON when creating Status objects.
>>> They provide a method to get the full json:
>>>
>>> String rawJSON = DataObjectFactory.getRawJSON(status);
>>>
>>> This method however sends another request to Twitter to get the original
>>> JSON.
>>> We have a few choices:
>>> 1. be okay with what twitter4j keeps {CreatedAt, Id, Text, Source,
>>> isTruncated, InReplyToStatusId, InReplyToUserId, InReplyToScreenName,
>>> GeoLocation, Place, isFavorited, isRetweeted, FavoriteCount, User,
>>> isRetweet, RetweetedStatus, Contributors, RetweetCount, isRetweetedByMe,
>>> CurrentUserRetweetId, PossiblySensitive, Lang,Scopes,
>> WithheldInCountries}.
>>> However this means that we will not get additional feeds in case the
>> actual
>>> data structure change. We can actually change this into JSON object using
>>> the method above and then we can use our ADM parser to parse it.
>>>
>>> 2. Instead of relying on twitter4j, we should be able to get the JSON
>>> objects directly using http requests to twitter. This way always gives us
>>> the complete JSON object as it comes from twitter.com and we will get
>> new
>>> fields the moment they are added.
>>>
>>> I think either way should be fine and I actually think that we should
>> stick
>>> to twitter4j for now and still use a specialized tweet parser which will
>>> simply transform the objects fields into ADM fields unless there is a
>>> strong need for fields that are not covered by the list in (1).
>>>
>>> My 2c,
>>> Abdullah.
>>>
>>>
>>>
>>> On Tue, Feb 23, 2016 at 3:46 AM, Jianfeng Jia <jianfeng.jia@gmail.com>
>>> wrote:
>>>
>>>> Dear devs,
>>>>
>>>> TwitterFeedAdapter is nice, but the internal TweetParser have some
>>>> limitations.
>>>> 1. We only pick a few JSON field, e.g. user, geolocation, message
>> field. I
>>>> need the place field. Also there are also some other fields the other
>>>> application may also interested in.
>>>> 2. The text fields always call getNormalizedString() to filter out the
>>>> non-ascii chars, which is a big loss of information. Even for the
>> English
>>>> txt there are emojis which are not “nomal”
>>>>
>>>> Apparently we can add the entire twitter structure into this parser. I’m
>>>> wondering if the current one-to-one mapping between Adapter and Parser
>>>> design is the best approach? The twitter data itself changes. Also there
>>>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
>>>> Weibo, Reddit ….  Could we have a general approach for all these data
>>>> sources?
>>>>
>>>> I’m thinking to have some field level JSON to ADM parsers
>>>> (int,double,string,binary,point,time,polygon…). Then by given the schema
>>>> option through Adapter we can easily assemble the field into one record.
>>>> The schema option could be a field mapping between original JSON id and
>> the
>>>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }.
As such,
>> we
>>>> don’t have to write the specific parser for different data source.
>>>>
>>>> Another thoughts is to just give the JSON object as it is, and rely on
>> the
>>>> user’s UDF to parse the data. Again, even in this case, user can
>>>> selectively override several field parsers that are different from ours.
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>>
>>>>
>>
>>
>> Best,
>>
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message