streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Blackmon <st...@blackmon.org>
Subject Re: [DISCUSS] Continuing the Momentum
Date Fri, 18 Apr 2014 21:30:36 GMT
Basically yes - as well as sequentially applying processors such as
lucene and tika to enrich the data prior to indexing.  In our
implementation filtering happens primarily at query time  Streams
collects the data from multiple sources, applies enrichments, writes
to our data repositories, and lets us replay specific portions of the
workflow in batch mode through hadoop if the pipeline changes.

I'm very interested in developing a simple pattern to encapsulate the
contributed modules into a web/API runtime (like tomcat / camel) as
well, which can serve as a runtime container for sourcing data into
our streams via webhooks, and enable message passing with loose
coupling between multiple streams installations.

Steve

On Fri, Apr 18, 2014 at 3:55 PM, Danny Sullivan <dsullivan7@hotmail.com> wrote:
> If we're not assuming all data coming in is activitystrea.ms formatted data, will the
flow through the application be:
> 1. take in any connectors (Twitter, Facebook, Gmail...)2. format data into activitystrea.ms
formatted objects3. filter for the most relevant of these activity objects4. output the most
relevant in a single stream
> ?
>> Date: Fri, 18 Apr 2014 09:32:43 -0400
>> Subject: Re: [DISCUSS] Continuing the Momentum
>> From: jletourneau80@gmail.com
>> To: dev@streams.incubator.apache.org
>>
>> FWIW - I think there are some really interesting use cases in the
>> enterprise that follow the "Real-time Processing for Activity Data
>> Streams".  Things like centralized logging like Splunk or LogStash seems to
>> also be a very compelling use of Streams.  It could be more focused on user
>> generated activity than those solutions, i.e. processing all user activity
>> on a host (processes started, login attempts, etc.), but I think it can
>> play in the same space very well based on the direction its heading.
>>
>> Jason
>>
>>
>> On Thu, Apr 17, 2014 at 11:27 PM, Steve Blackmon <steve@blackmon.org> wrote:
>>
>> > Chris, I think you are right that the group should focus our efforts,
>> > and that online activities (broadly defined) are the sweet spot.  I
>> > just wouldn't want to give potential users or contributors the idea
>> > that Streams is just for ActivityStreams - which I at least associate
>> > with small data sets.  At least they look small viewed through Jira,
>> > Jive, and similar tools.  Streams is also a big data processing engine
>> > which can take advantage of the best features of storm or yarn while
>> > significantly reducing the learning curve and code complexity of those
>> > frameworks.
>> >
>> > So long as the website makes it clear that activity data is a concept
>> > and Streams can work regardless of how the data and metadata are
>> > shaped, I'm cool with "Real-time Processing for Activity Data Streams"
>> > as a tagline.
>> >
>> > Steve
>> >
>> > On Thu, Apr 17, 2014 at 8:04 PM, Chris Geer <chris@cxtsoftware.com> wrote:
>> > > On Thu, Apr 17, 2014 at 9:32 AM, Steve Blackmon <steve@blackmon.org>
>> > wrote:
>> > >
>> > >> >> Target audience is our potential users.  Technical in nature,
but it
>> > >> still
>> > >> >> needs to be succinct.
>> > >> >>
>> > >> >
>> > >> > Ok, with that said, I think the tag-line should be more feature
>> > focused
>> > >> > because that can hook both the tech guys and business guys.
>> > >>
>> > >> Agreed
>> > >>
>> > >> > We also need to make careful just using the term "streams" because
>> > >> really this isn't a
>> > >> > generic stream processor (aka storm), our focus is on Activity
>> > Streams.
>> > >> > Maybe activity streams is a bad descriptor as well and Activity
Data
>> > >> might
>> > >> > be better. "Real-time Processing for Activity Data Streams"???
>> > >> >
>> > >>
>> > >> The engine actually doesn't care whether documents being processed
are
>> > >> activity-related or not:
>> > >> any JVM object that jackson can serialize and deserialize work just
>> > >> fine as datums.
>> > >>
>> > >> I think we can acknowledge that the community has a bias toward
>> > >> ActivityStreams, but we shouldn't
>> > >> downplay the flexibility Streams provides.  Focusing only on activity
>> > >> data in project messaging
>> > >> undercuts the fact that Streams is a powerful, flexible ESB/ETL
>> > >> replacement.
>> > >>
>> > >
>> > > My 2-cents for what it's worth. If we don't focus on a niche this won't
>> > > take off. ESB/ETL systems are a dime-a-dozen and to be really good in
>> > that
>> > > space is a big endeavor. I'm not saying this system couldn't fill some
of
>> > > those needs but I think it's a bad idea to be that broad.
>> > >
>> > >>
>> > >> >>
>> > >> >>
>> > >> >> >
>> > >> >> > >
>> > >> >> > > ?
>> > >> >> > >
>> > >> >> > > On Thu, Apr 17, 2014 at 8:26 AM, Matt Franklin <
>> > >> >> m.ben.franklin@gmail.com
>> > >> >> > >
>> > >> >> > > wrote:
>> > >> >> > > > On Mon, Apr 14, 2014 at 5:22 PM, Renato MarroquĂ­n
Mogrovejo <
>> > >> >> > > > renatoj.marroquin@gmail.com> wrote:
>> > >> >> > > >
>> > >> >> > > >> Hi devs,
>> > >> >> > > >>
>> > >> >> > > >> Yeah the title was indeed compelling. You
got me on that one
>> > lol
>> > >> >> > > >> I think that you guys are right saying
that for attracting new
>> > >> >> people
>> > >> >> > > maybe
>> > >> >> > > >> we should try making the project's goal
something more
>> > >> applicable in
>> > >> >> > > real
>> > >> >> > > >> life than just being "a Lightweight server
for
>> > ActivityStreams".
>> > >> >> > > >> I liked the simple explanation I heard,maybe
it was the pisco
>> > but
>> > >> >> > please
>> > >> >> > > >> correct me if I am wrong, "it's an abstraction
layer for
>> > stream
>> > >> >> > > processing
>> > >> >> > > >> engines". IMHO we have two things defined:
>> > >> >> > > >>
>> > >> >> > > >> MISION:
>> > >> >> > > >> 1)  A flexible data processing framework
that can run in
>> > multiple
>> > >> >> > > different
>> > >> >> > > >> runtimes.  The goal being to abstract platform
complexity and
>> > >> allow
>> > >> >> > for
>> > >> >> > > >> business logic reuse across real-time,
enterprise, web and
>> > >> >> stand-alone
>> > >> >> > > >> executions.
>> > >> >> > > >>
>> > >> >> > > >> This is what needs to be done.
>> > >> >> > > >
>> > >> >> > > >
>> > >> >> > > >> VISION:
>> > >> >> > > >> 2)  As a proving ground for the adoption
of data format
>> > >> standards,
>> > >> >> > > >> specifically ActivityStreams to start.
 The community would
>> > work
>> > >> to
>> > >> >> > > drive
>> > >> >> > > >> the adoption and evolution of such standards
through
>> > real-world
>> > >> >> > > experience.
>> > >> >> > > >>
>> > >> >> > > >> This is where we would like to get at some
time. But also to
>> > get
>> > >> >> more
>> > >> >> > > >> community engaged, things have to simple.
That is a big issue
>> > we
>> > >> >> still
>> > >> >> > > have
>> > >> >> > > >> over in Gora, and we are trying to solve
it through talks,
>> > better
>> > >> >> > > >> tutorials, integration with other projects,
and so forth.
>> > >> >> > > >> Just my 2cents guys.
>> > >> >> > > >>
>> > >> >> > > >
>> > >> >> > > > So what is the tag line that sums up both the
mission and the
>> > >> vision?
>> > >> >> > > >
>> > >> >> > > >
>> > >> >> > > >>
>> > >> >> > > >>
>> > >> >> > > >> Renato M.
>> > >> >> > > >>
>> > >> >> > > >>
>> > >> >> > > >> 2014-04-14 16:31 GMT+02:00 Matt Franklin
<
>> > >> m.ben.franklin@gmail.com
>> > >> >> >:
>> > >> >> > > >>
>> > >> >> > > >> > On Fri, Apr 11, 2014 at 5:01 PM, Steve
Blackmon <
>> > >> >> > sblackmon@apache.org
>> > >> >> > > >> > >wrote:
>> > >> >> > > >> >
>> > >> >> > > >> > > On Thu, Apr 10, 2014 at 4:11
PM, Matt Franklin <
>> > >> >> > > >> m.ben.franklin@gmail.com
>> > >> >> > > >> > >
>> > >> >> > > >> > > wrote:
>> > >> >> > > >> > > > tl;dr version:
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > We need to discuss things
on the list more and work to
>> > >> define
>> > >> >> > > >> streams,
>> > >> >> > > >> > > > update our public presence
to support this definition
>> > and
>> > >> >> > > encourage
>> > >> >> > > >> > > > additional engagement.
>> > >> >> > > >> > > >
>> > >> >> > > >> > > +1, +1, +1
>> > >> >> > > >> > >
>> > >> >> > > >> > > > Long version:
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > For those of you unaware,
Steve Blackmon gave a nice
>> > talk
>> > >> on
>> > >> >> the
>> > >> >> > > work
>> > >> >> > > >> > he
>> > >> >> > > >> > > > has been committing to Streams
at ApacheCon.  As part of
>> > >> that
>> > >> >> > talk
>> > >> >> > > >> and
>> > >> >> > > >> > > > follow on discussions, it
became clear that we as a
>> > >> community
>> > >> >> > > need to
>> > >> >> > > >> > do
>> > >> >> > > >> > > > some serious work to define
ourselves, what we are
>> > building
>> > >> >> and
>> > >> >> > > why
>> > >> >> > > >> it
>> > >> >> > > >> > is
>> > >> >> > > >> > > > valuable to the industry.
>> > >> >> > > >> > > >
>> > >> >> > > >> > > If anyone who missed the presentation
wants to see it, I'm
>> > >> happy
>> > >> >> > to
>> > >> >> > > >> > > host a google hangout to run
through it.
>> > >> >> > > >> > >
>> > >> >> > > >> >
>> > >> >> > > >> > Can you post it, or a link to it,
on the website too?
>> > >> >> > > >> >
>> > >> >> > > >> >
>> > >> >> > > >> > >
>> > >> >> > > >> > > > Our website says we are
a Lightweight server for
>> > >> >> > ActivityStreams.
>> > >> >> > > >> >  While
>> > >> >> > > >> > > > this is true to some degree,
I think recent
>> > contributions
>> > >> >> should
>> > >> >> > > >> refine
>> > >> >> > > >> > > > this.  The new code is really
about supporting flexible
>> > >> >> > > processing,
>> > >> >> > > >> > > > persistence and retrieval
of data in multiple runtimes
>> > >> using
>> > >> >> > > strongly
>> > >> >> > > >> > > > typed, normalized data formats
like ActivityStreams.
>> > >> >> >  Personally,
>> > >> >> > > I
>> > >> >> > > >> > think
>> > >> >> > > >> > > > this slightly new direction
is extremely compelling, and
>> > >> the
>> > >> >> > > reaction
>> > >> >> > > >> > to
>> > >> >> > > >> > > > Steve's talk seems to support
that.  The question
>> > remains
>> > >> how
>> > >> >> > does
>> > >> >> > > >> the
>> > >> >> > > >> > > > community as a whole see
the project?  What value is
>> > >> everyone
>> > >> >> > > wanting
>> > >> >> > > >> > to
>> > >> >> > > >> > > > get out of this effort?
>> > >> >> > > >> > > >
>> > >> >> > > >> > > The session tag-line which attracted
~20 attendees was
>> > >> >> > 'Simplifying
>> > >> >> > > >> > > Real-Time data integration with
Apache Streams.' From
>> > >> talking to
>> > >> >> > > >> > > coders and data scientists I
always hear frustration with
>> > how
>> > >> >> much
>> > >> >> > > >> > > time they spend writing code
and workflow to move bytes
>> > >> around
>> > >> >> and
>> > >> >> > > >> > > keep track of their data assets.
I'd wager any survey of
>> > >> >> prominent
>> > >> >> > > >> > > open-source libraries and popular
commercial APIs would
>> > have
>> > >> to
>> > >> >> > > >> > > conclude that schema and interface
standards are
>> > completely
>> > >> >> absent
>> > >> >> > > >> > > or sparsely adopted at many layers.
>> > >> >> > > >> > >
>> > >> >> > > >> > > Standards in hardware, operating
systems, networks, and
>> > >> >> relational
>> > >> >> > > >> > > databases brought about flourishing
ecosystems. I believe
>> > >> >> > standards
>> > >> >> > > in
>> > >> >> > > >> > > data interchange such as ActivityStreams
can do the same
>> > for
>> > >> the
>> > >> >> > > >> > > social web, but not everyone
will embrace standards for
>> > the
>> > >> sake
>> > >> >> > of
>> > >> >> > > >> > > standards. If we can offer integration
points to the data
>> > >> >> sources
>> > >> >> > > and
>> > >> >> > > >> > > repositories businesses want
to work with, and demonstrate
>> > >> that
>> > >> >> > > >> > > Streams can handle 'fire-hose'
scale data volumes with
>> > >> >> arbitrarily
>> > >> >> > > >> > > many intermediate hand-offs and
processing steps on
>> > messages
>> > >> in
>> > >> >> > > >> > > flight, I think we will see adoption
from enterprises
>> > >> looking to
>> > >> >> > > >> > > replace ESB-type systems that
can't keep up with the
>> > volume
>> > >> of
>> > >> >> > data
>> > >> >> > > >> > > generated (both inside and outside
their networks) that
>> > they
>> > >> >> want
>> > >> >> > to
>> > >> >> > > >> > > track.  Streams is pretty decent
at ETL as well - a
>> > function
>> > >> >> that
>> > >> >> > is
>> > >> >> > > >> > > never going away, even as the
underlying tools best
>> > suited to
>> > >> >> > > >> > > performing it at scale constantly
change.
>> > >> >> > > >> > >
>> > >> >> > > >> > > This future-state I'm attempting
to describe will be a
>> > better
>> > >> >> one
>> > >> >> > > for
>> > >> >> > > >> > > researchers, hobbyists, entrepreneurs,
and consumers of
>> > web
>> > >> >> > products
>> > >> >> > > >> > > and services.  Configuration-driven,
runtime-platform
>> > >> agnostic,
>> > >> >> > > >> > > software for real-time data exchange:
 where
>> > community-driven
>> > >> >> > > >> > > standards such as Activity Streams
can codify and evolve
>> > >> >> > > >> > > best-practices via running code.
 That is a vision that I
>> > >> think
>> > >> >> > will
>> > >> >> > > >> > > help us generate significant
traction going forward.
>> > >> >> > > >> > >
>> > >> >> > > >> >
>> > >> >> > > >> > Just to make sure I am understanding
you correctly, you are
>> > >> >> > proposing
>> > >> >> > > we
>> > >> >> > > >> > update the mission of the project
to the following:
>> > >> >> > > >> >
>> > >> >> > > >> > 1)  A flexible data processing framework
that can run in
>> > >> multiple
>> > >> >> > > >> different
>> > >> >> > > >> > runtimes.  The goal being to abstract
platform complexity
>> > and
>> > >> >> allow
>> > >> >> > > for
>> > >> >> > > >> > business logic reuse across real-time,
enterprise, web and
>> > >> >> > stand-alone
>> > >> >> > > >> > executions.
>> > >> >> > > >> > 2)  As a proving ground for the adoption
of data format
>> > >> standards,
>> > >> >> > > >> > specifically ActivityStreams to start.
 The community would
>> > >> work
>> > >> >> to
>> > >> >> > > drive
>> > >> >> > > >> > the adoption and evolution of such
standards through
>> > real-world
>> > >> >> > > >> experience.
>> > >> >> > > >> >
>> > >> >> > > >> > This sounds great, though it is slightly
different than the
>> > >> >> > initially
>> > >> >> > > >> > proposed functionality.  Personally,
I have no objection to
>> > >> that,
>> > >> >> as
>> > >> >> > > what
>> > >> >> > > >> > you describe encompasses the original
goals and expands on
>> > >> them;
>> > >> >> > but,
>> > >> >> > > it
>> > >> >> > > >> > would be good for the rest of the
community to weigh in.
>> > >> >> > > >> >
>> > >> >> > > >> >
>> > >> >> > > >> > >
>> > >> >> > > >> > > > The fact that there are
not clear answers (and
>> > >> corresponding
>> > >> >> > > >> documented
>> > >> >> > > >> > > > statements on the website)
to these questions already
>> > >> means we
>> > >> >> > are
>> > >> >> > > >> not
>> > >> >> > > >> > > > doing a great job of following
the Apache Way.  The
>> > Apache
>> > >> Way
>> > >> >> > is
>> > >> >> > > >> about
>> > >> >> > > >> > > the
>> > >> >> > > >> > > > community and meritocratic,
community-based decision
>> > >> making.
>> > >> >> >  The
>> > >> >> > > ASF
>> > >> >> > > >> > > > defines it as follows:
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > While there is not an official
list, these six
>> > principles
>> > >> have
>> > >> >> > > been
>> > >> >> > > >> > cited
>> > >> >> > > >> > > > as the core beliefs of philosophy
behind the foundation,
>> > >> which
>> > >> >> > is
>> > >> >> > > >> > > normally
>> > >> >> > > >> > > > referred to as "The Apache
Way":
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > collaborative software development
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > commercial-friendly standard
license
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > consistently high quality
software
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > respectful, honest, technical-based
interaction
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > faithful implementation
of standards
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > security as a mandatory
feature
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > All of the ASF projects
share these principles.
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > Let's make sure we propose
changes to the list, create
>> > >> tickets
>> > >> >> > > that
>> > >> >> > > >> > > support
>> > >> >> > > >> > > > wider efforts and leverage
principles like lazy
>> > consensus
>> > >> to
>> > >> >> > keep
>> > >> >> > > >> > moving
>> > >> >> > > >> > > > forward in a way that supports
the community.
>> > >> >> > > >> > > +1, +1, +1
>> > >> >> > > >> > >
>> > >> >> > > >> > > On Thu, Apr 10, 2014 at 4:11
PM, Matt Franklin <
>> > >> >> > > >> m.ben.franklin@gmail.com
>> > >> >> > > >> > >
>> > >> >> > > >> > > wrote:
>> > >> >> > > >> > > > tl;dr version:
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > We need to discuss things
on the list more and work to
>> > >> define
>> > >> >> > > >> streams,
>> > >> >> > > >> > > > update our public presence
to support this definition
>> > and
>> > >> >> > > encourage
>> > >> >> > > >> > > > additional engagement.
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > Long version:
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > For those of you unaware,
Steve Blackmon gave a nice
>> > talk
>> > >> on
>> > >> >> the
>> > >> >> > > work
>> > >> >> > > >> > he
>> > >> >> > > >> > > > has been committing to Streams
at ApacheCon.  As part of
>> > >> that
>> > >> >> > talk
>> > >> >> > > >> and
>> > >> >> > > >> > > > follow on discussions, it
became clear that we as a
>> > >> community
>> > >> >> > > need to
>> > >> >> > > >> > do
>> > >> >> > > >> > > > some serious work to define
ourselves, what we are
>> > building
>> > >> >> and
>> > >> >> > > why
>> > >> >> > > >> it
>> > >> >> > > >> > is
>> > >> >> > > >> > > > valuable to the industry.
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > Our website says we are
a Lightweight server for
>> > >> >> > ActivityStreams.
>> > >> >> > > >> >  While
>> > >> >> > > >> > > > this is true to some degree,
I think recent
>> > contributions
>> > >> >> should
>> > >> >> > > >> refine
>> > >> >> > > >> > > > this.  The new code is really
about supporting flexible
>> > >> >> > > processing,
>> > >> >> > > >> > > > persistence and retrieval
of data in multiple runtimes
>> > >> using
>> > >> >> > > strongly
>> > >> >> > > >> > > > typed, normalized data formats
like ActivityStreams.
>> > >> >> >  Personally,
>> > >> >> > > I
>> > >> >> > > >> > think
>> > >> >> > > >> > > > this slightly new direction
is extremely compelling, and
>> > >> the
>> > >> >> > > reaction
>> > >> >> > > >> > to
>> > >> >> > > >> > > > Steve's talk seems to support
that.  The question
>> > remains
>> > >> how
>> > >> >> > does
>> > >> >> > > >> the
>> > >> >> > > >> > > > community as a whole see
the project?  What value is
>> > >> everyone
>> > >> >> > > wanting
>> > >> >> > > >> > to
>> > >> >> > > >> > > > get out of this effort?
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > The fact that there are
not clear answers (and
>> > >> corresponding
>> > >> >> > > >> documented
>> > >> >> > > >> > > > statements on the website)
to these questions already
>> > >> means we
>> > >> >> > are
>> > >> >> > > >> not
>> > >> >> > > >> > > > doing a great job of following
the Apache Way.  The
>> > Apache
>> > >> Way
>> > >> >> > is
>> > >> >> > > >> about
>> > >> >> > > >> > > the
>> > >> >> > > >> > > > community and meritocratic,
community-based decision
>> > >> making.
>> > >> >> >  The
>> > >> >> > > ASF
>> > >> >> > > >> > > > defines it as follows:
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > While there is not an official
list, these six
>> > principles
>> > >> have
>> > >> >> > > been
>> > >> >> > > >> > cited
>> > >> >> > > >> > > > as the core beliefs of philosophy
behind the foundation,
>> > >> which
>> > >> >> > is
>> > >> >> > > >> > > normally
>> > >> >> > > >> > > > referred to as "The Apache
Way":
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > collaborative software development
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > commercial-friendly standard
license
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > consistently high quality
software
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > respectful, honest, technical-based
interaction
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > faithful implementation
of standards
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > security as a mandatory
feature
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > All of the ASF projects
share these principles.
>> > >> >> > > >> > > >
>> > >> >> > > >> > > > Let's make sure we propose
changes to the list, create
>> > >> tickets
>> > >> >> > > that
>> > >> >> > > >> > > support
>> > >> >> > > >> > > > wider efforts and leverage
principles like lazy
>> > consensus
>> > >> to
>> > >> >> > keep
>> > >> >> > > >> > moving
>> > >> >> > > >> > > > forward in a way that supports
the community.
>> > >> >> > > >> > >
>> > >> >> > > >> > >
>> > >> >> > > >> > >
>> > >> >> > > >> > > --
>> > >> >> > > >> > > Steve Blackmon
>> > >> >> > > >> > > sblackmon@apache.org
>> > >> >> > > >> > >
>> > >> >> > > >> >
>> > >> >> > > >>
>> > >> >> > >
>> > >> >> >
>> > >> >>
>> > >>
>> >
>

Mime
View raw message