any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Mostarda <michele.mosta...@gmail.com>
Subject Re: Context Aware Extraction
Date Tue, 10 Jun 2014 08:30:03 GMT
Hi Folks,
  Any23 has been already prepared for the extraction context, see
org.apache.any23.extractor.ExtractionResult
org.apache.any23.extractor.ExtractionContext
And the main extractors already add information about the position of
extracted triples in pages and the nesting relationship of extracted
subgraphs.
The most part of this information ATM can be accessed activating the
Annotate flag which include such information within comments.
The missing part is the inclusion of this information in RDF renderization.
Best
Michele



On 6 June 2014 15:30, Giovanni Tummarello <g.tummarello@gmail.com> wrote:

> the main motivation for this is to make sure data is really relevant and is
> put together HTML elements (e.g. like scraping) with metadata.
>
> Sometime one has a metadata description (e.g. name) but not the phone
> number which is just in html.
>
> how about a Json output that a configurably large "surrounding" html but
> also the triples e.g. in standardized/normalized as much as possible json
> LD ?
>
> i think this could be useful to better understand web pages but you're
> right with you point 3) i personally dont have any specific need just now
> so wouldnt feel like pushing for develoment this direction just yet
>
> Gio
>
>
> On Fri, Jun 6, 2014 at 11:51 AM, Szymon Danielczyk <
> danielczyk.szymon@gmail.com> wrote:
>
> > Hi Lewis, Guys
> >
> > Just to understand this better. Does this mean that if some info was
> > extracted from
> >
> > http://example.org/path  let say from head section of the page
> >
> >
> > A)
> >
> > the graph part become
> >
> > <http://example.org/path#head>
> >
> > but if from let say html5 "article" tag it will be
> >
> > <http://example.org/path#article>
> >
> > B)
> > Or it is more like
> >
> > <s> <p> <o> <http://example.org/path> .
> > <s> <hasContext>  <http://example.org/path#context> <
> > http://example.org/path>
> > .
> > <http://example.org/path#context> <foundInside> "html/head" <
> > http://example.org/path> .
> > <http://example.org/path#context> <foundAtDate> "01-May-2014" <
> > http://example.org/path> .
> > <http://example.org/path#context> <foundBy> "...." <
> > http://example.org/path>
> > .
> > etc ..
> >
> >
> > I would like ask:
> >
> > 1) Where you thinking more like A or B approach ?
> >
> > 2) what tags will this feature support, maybe some subset like body,head
> > plus some of the new html5 ones: article, aside, header, footer etc. ?
> > or maybe you thought of giving the full xpath to the section like
> > "html/body/article/div[1]"
> >
> > 3) Did you guys thought about some practical use case already ? How this
> > information could be useful to someone ?
> >
> > Cheers
> > Szymon
> >
> > On 6 June 2014 00:35, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
> > wrote:
> >
> > > Hi Folks,
> > > Giovanni and myself were recently discussing the concept of
> context-aware
> > > triples extraction. An example of this would be the 'where' the triples
> > > came from (within the WebPage) as well as the triple itself.
> > > This of course bares close resemblance to N-Quads, however we
> substitute
> > > the additional graph constituent with the 'context' one suggested
> above.
> > > Does anyone have comments and/or suggestions on how we could implement
> a
> > > context-aware extractor model/API on top of what we currently have?
> > > Lewis
> > >
> > > --
> > > *Lewis*
> > >
> >
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message