annotator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sasha Goodman <em...@sashagoodman.com>
Subject Re: DOM Iteration (was Re: Just a simple example?)
Date Tue, 16 May 2017 21:26:16 GMT
Thanks TB Dinesh for finding the bug. I just fixed this, at least on my
Firefox, so hopefully the github pages content provider has updated the
files...you might need to hit refresh or wait a few minutes. The issue was
that in Firefox has a stricter implementation of document.evaluate where no
default value is provided for the fifth argument. The code also seems to
work on my Safari.

On Tue, May 16, 2017 at 10:46 AM TB Dinesh <dinesh@servelots.com> wrote:

> Sasha. Thanks.
> Fyi. Demo works on Chrome. Not on Firefox.
>
> On Tue, May 16, 2017 at 12:49 AM, Sasha Goodman <email@sashagoodman.com>
> wrote:
> > Here is a demo of simple annotation, thanks to Benjamin:
> >
> > https://predict-r.github.io/annotation-model/
> >
> >
> > On Fri, May 12, 2017 at 12:19 PM Sasha Goodman <email@sashagoodman.com>
> > wrote:
> >
> >> I would be delighted if my efforts were useful in this project!!!
> >> Regarding that code, if any parts are used it would make my week. The
> class
> >> structure is sorta self-documented by the standard, and combined with
> >> builders the classes it can accommodate a variety of motives.
> >>
> >> Highlighting is the most common motive now (correct me if I'm wrong). My
> >> gut-feeling is that to get the support and time of hard core annotators,
> >> the code needs to accommodate the idiosyncrasies of highlighting first.
> For
> >> example, if there are thousands of highlights on a page, an annotation
> >> builder might iterate/walk the document just once and fill in the
> thousands
> >> of highlights in one pass. Also, a highlighting app would probably need
> to
> >> modify the source document by inserting spans and such.
> >>
> >> If Randall needs familiar code for node iteration, tree walking, range
> >> splitting, string similarity and normalization, that's cool! Custom
> code,
> >> *especially* Polyfill type implementations, could smooth over browser
> >> idiosyncrasies. Also, I saw a Jsperf.com microbenchmark that put custom
> >> walkers on par with the native browser based ones.
> >>
> >> On a personal note, I do archival work and did not initially see the
> value
> >> in modifying the source document by inserting spans (however, a
> highlight
> >> app would need that). The main reason I'm excited about annotation is
> its
> >> value for labeling data for text analysis and machine learning. A lot of
> >> the advancements in machine learning are because of large bodies of data
> >> that have been tagged. The most common examples are usually of images
> that
> >> have regions selected and then labeled, but annotation could also help
> turn
> >> semi-structured text into more structured text data (e.g. for labeling
> >> parts of government documents). For archival work on mostly static
> >> documents, there does not seem to be a need to modify source document.
> On
> >> the other hand, for dynamically changing documents, inserting spans with
> >> unique IDs seems appropriate because its more robust to document
> changes.
> >> Yet, it is also vulnerable to turf battles with other extensions and the
> >> page's own javascript, so I hope it's not a requirement of the Apache
> >> library but rather a feature.
> >>
> >>
> >> On Thu, May 11, 2017 at 1:43 PM Benjamin Young <byoung@bigbluehat.com>
> >> wrote:
> >>
> >>> Exciting to see this conversation happening. ^_^
> >>>
> >>>
> >>> Randall, how feasible would it be to bring (soon) your libraries (even
> >>> via copy/paste) into the Apache Annotator repo. I believe (according to
> >>> GitHub) you're author/owner of 90%+ of the code in them, and
> (consequently)
> >>> able to do that if you believe that's the right step.
> >>>
> >>>
> >>> Sasha you're classes modeled around the selector and a "builder" sound
> >>> very similar to the hopes I wrote up in
> >>> https://cwiki.apache.org/confluence/display/ANNO/Planning
> >>>
> >>>
> >>> I'd very much like to combine these efforts in some way.
> >>>
> >>>
> >>> Additionally--and the thing driving me personally at the moment--I have
> >>> to present on Apache Annotator next Wednesday!
> >>>
> >>> https://apachecon2017.sched.com/event/AbBW
> >>>
> >>>
> >>> Consequently, I'd very much love it if we (collectively) could build a
> >>> demo together! There's plenty to talk about wrt to annotation,
> community
> >>> building, Web Annotation Data Model & Protocol, as well as why (those
> of us
> >>> that are here at least) have chosen to start collaborating at the ASF.
> >>>
> >>>
> >>> At any rate, I plan to be coding on all the things leading up to
> >>> Wednesday, so any help, input, pointers, and code (hehe) that anyone
> wants
> >>> to toss in ahead of my codez, I'd be most grateful to code together!
> >>>
> >>>
> >>> Thanks, all!
> >>>
> >>> Benjamin
> >>>
> >>> --
> >>>
> >>> http://bigbluehat.com/
> >>>
> >>> http://linkedin.com/in/benjaminyoung
> >>>
> >>> ________________________________
> >>> From: Randall Leeds <randall@apache.org>
> >>> Sent: Thursday, May 11, 2017 3:34:24 PM
> >>> To: dev@annotator.incubator.apache.org
> >>> Subject: DOM Iteration (was Re: Just a simple example?)
> >>>
> >>> Great to see you here, Sasha!
> >>>
> >>> On Wed, May 10, 2017 at 5:39 PM Sasha Goodman <email@sashagoodman.com>
> >>> wrote:
> >>>
> >>> >
> >>> > P.S. This afternoon I streamlined the TextQuoteSelector and
> >>> > TextPositionSelector to work (in principle ) consistently with
> Randall
> >>> > Leed's implementation that used NodeIterator and textContents.
> >>> >
> >>> >
> >>> Neat :).
> >>>
> >>> I think my takeaway from the simple example thread, and something of
> which
> >>> many of us were likely already well aware, is that there's a desire
> for a
> >>> good highlighter implementation. A way to highlight text is often the
> >>> first
> >>> example people want to see.
> >>>
> >>> While I hope to see experimentation with implementations that try to
> limit
> >>> the impact on the DOM, I think <mark> or <span> wrapping of
text nodes
> is
> >>> still the easiest to understand. In this approach, the actual wrapping
> is
> >>> easy. The difficult part is iteration.
> >>>
> >>> Now, some quick background on node iteration.
> >>>
> >>> I chose to use NodeIterator rather than TreeWalker for my dom-seek
> library
> >>> because it meant that the seek function could be stateless, support
> >>> seeking
> >>> forward and backward, and still be able to return the number of
> characters
> >>> consumed by a seek. The desire to know whether to include the current
> >>> node's content in the seek count is fulfilled by NodeIterator's
> >>> "pointerBeforeReferenceNode". Essentially, a NodeIterator stores a
> point
> >>> before or after a node, rather than simply a current node.
> >>>
> >>> However, using NodeIterator to traverse a Range is not really great.
> Since
> >>> it has a read only currentNode, the best that can be done is to start
> with
> >>> the commonAncestorContainer of the Range. Range has compareNode,
> >>> comparePoint, and isPointInRange. I have no idea how expensive these
> are.
> >>> Iterating all the nodes under the commonAncestorContainer doesn't feel
> >>> great to begin with. TreeWalker might be more appropriate since its
> >>> currentNode could be set to startContainer directly. TreeWalker also
> >>> appears to have consistent platform support.
> >>>
> >>> All of this is complicated by the Range being able to point to offsets
> >>> within text nodes. For the purposes of highlighting with wrapper
> elements
> >>> it's necessary to split the boundary nodes. I think there are probably
> a
> >>> number of libraries for this, but I propose we write one under our
> repo.
> >>>
> >>> We might also find that normalizing the endpoints of a Range in some
> >>> fashion is a helpful prerequisite. There is a library I found that does
> >>> this, but I found its algorithm terribly confusing. I put time into
> >>> rewriting it without dependencies. Despite some initial excitement, the
> >>> author never fully vetted and accepted my pull request:
> >>> https://github.com/webmodules/range-normalize/pull/2
> >>>
> >>> In conclusion, I think there'd be value in bringing some functional
> >>> utilities into Apache Annotator for dealing with iteration, range
> >>> splitting, and range normalization, with the goal of providing a very
> >>> succinct and simple highlighter that looks like this:
> >>>
> >>> ```
> >>> for (const node of textNodes(range)) {
> >>>   const mark = document.createElement('mark');
> >>>   node.replaceWith(mark);
> >>>   mark.appendChild(node);
> >>> }
> >>> ```
> >>>
> >>> Some care needs to be taken that whatever iteration we use is not
> >>> invalidated by the replacement of the text node with its wrapper.
> >>>
> >>> The fact that a simple example like this is hard to produce is
> evidence of
> >>> the underlying complexity described in the above paragraphs. When I see
> >>> people wanting a simple highlighter what I hear is that they actually
> need
> >>> simple abstractions upon which to build a highlighter. The highlighter
> >>> itself should be easy. Often, highlighters that projects provide are
> not
> >>> shipped standalone or don't do exactly what the author needs (use spans
> >>> instead of marks, add a particular class, coalesce overlapping
> highlights
> >>> or not, etc). There is lots of room to do different things but being
> able
> >>> to simply get the nodes to be highlighted is the prerequisite task that
> >>> contains most of the complexity.
> >>>
> >>> That's all (and probably way too much) for now. Finding all the tools
> for
> >>> all these things is a pain enough that I think we should have a
> >>> comprehensive set of such utilities in Apache Annotator, even if that
> >>> risks
> >>> looking like a bit of NIH syndrome.
> >>>
> >>> Unless anyone objects, I think I'll aim to ship libraries for these:
> >>> - Node iteration (https://github.com/tilgovi/dom-node-iterator)
> >>> - Tree walking (might not need a library if support is good)
> >>> - Range splitting
> >>> - Range normalization (see my pull request reference, above)
> >>> - Range iterating
> >>> - Text distance (https://github.com/tilgovi/dom-seek)
> >>>
> >>> If anyone wants to start on any of the above, you're welcome to depend
> on
> >>> libraries that are outside Apache Annotator. In the case of libraries
> that
> >>> I've written, there is value to bringing them into Apache Annotator
> >>> because
> >>> they are all written in ES6 but not packaged to be consumed as ES6.
> >>> Bringing them inside our repo means better code deduplication by tree
> >>> shaking in tools like rollup and webpack. They could be packaged as ES6
> >>> where they are, but if I'm going to spend time improving the packaging
> I
> >>> would rather just toss out the packaging and get the benefits of the
> >>> monorepo having all that build/test boilerplate done once for all of
> them.
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message