streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hager (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (STREAMS-79) RegEx Extractor Module
Date Tue, 13 May 2014 16:27:15 GMT

    [ https://issues.apache.org/jira/browse/STREAMS-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996564#comment-13996564
] 

Matthew Hager commented on STREAMS-79:
--------------------------------------

This is a very well known problem. I would suggest looking into Lucene's library to extract
these tokens. While this is very straight forward for a language like English, Spanish, or
even Russian. This gets much more complicated when working with languages like Chinese, Japanese,
and Hindi. 

Twitter had this exact same problem and used Lucene to solve it and saw an 8x improvement
in performance. I can point you to some examples if it would be helpful.

> RegEx Extractor Module
> ----------------------
>
>                 Key: STREAMS-79
>                 URL: https://issues.apache.org/jira/browse/STREAMS-79
>             Project: Streams
>          Issue Type: New Feature
>            Reporter: Matt Franklin
>
> Some data sources do not separate out shared links, hashtags and @mentions.  This module
will use predefined regular expressions to parse the content of an Activity object to extract
these entities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message