beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "SungJunyoung (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-1439) Beam Example(s) exploring public document datasets
Date Mon, 20 Mar 2017 15:47:41 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932063#comment-15932063
] 

SungJunyoung edited comment on BEAM-1439 at 3/20/17 3:47 PM:
-------------------------------------------------------------

The current Beam example counts the number of occurrences of words for Shakespeare's work.
This, of course, is a good indication of how Beam's basic pipeline construction works. However,
this data is static, and does not show the characteristics of Beam that handles streaming
data. What about example sources with streaming data like Kafka or Spark? For example, you
could save your computer's input log to Kafka, convert it to a Beam, and then perform statistics
on your input habits. What do you think about this?

Of course, ideas for large-scale pipelines will continue in processing in parallel like **Beam**
:).


was (Author: wnsdud1861):
The current Beam example counts the number of occurrences of a word for Shakespeare's work.
This, of course, is a good indication of how Beam's basic pipeline construction works. However,
this data is static, and does not show the characteristics of Beam that handles streaming
data. What about example sources with streaming data like Kafka or Spark? For example, you
could save your computer's input log to Kafka, convert it to a Beam, and then perform statistics
on your input habits. What do you think about this?

Of course, ideas for large-scale pipelines will continue in processing in parallel like **Beam**
:).

> Beam Example(s) exploring public document datasets
> --------------------------------------------------
>
>                 Key: BEAM-1439
>                 URL: https://issues.apache.org/jira/browse/BEAM-1439
>             Project: Beam
>          Issue Type: Wish
>          Components: examples-java
>            Reporter: Kenneth Knowles
>            Assignee: Kenneth Knowles
>            Priority: Minor
>              Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and performing
a basic TF-IDF analysis on the works of Shakespeare (or whatever you point it at). It would
be even cooler to do these analyses, and more, on a much larger data set that is really the
subject of current investigations.
> In chatting with professors at the University of Washington, I've learned that scholars
of many fields would really like to explore new and highly customized ways of processing the
growing body of publicly-available scholarly documents, such as PubMed Central. Queries like
"show me documents where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some large-scale
Beam pipelines to perform analyses such as term frequency, bigram frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message