annotator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TB Dinesh <din...@servelots.com>
Subject Re: DOM Iteration (was Re: Just a simple example?)
Date Tue, 16 May 2017 17:46:02 GMT
Sasha. Thanks.
Fyi. Demo works on Chrome. Not on Firefox.

On Tue, May 16, 2017 at 12:49 AM, Sasha Goodman <email@sashagoodman.com> wrote:
> Here is a demo of simple annotation, thanks to Benjamin:
>
> https://predict-r.github.io/annotation-model/
>
>
> On Fri, May 12, 2017 at 12:19 PM Sasha Goodman <email@sashagoodman.com>
> wrote:
>
>> I would be delighted if my efforts were useful in this project!!!
>> Regarding that code, if any parts are used it would make my week. The class
>> structure is sorta self-documented by the standard, and combined with
>> builders the classes it can accommodate a variety of motives.
>>
>> Highlighting is the most common motive now (correct me if I'm wrong). My
>> gut-feeling is that to get the support and time of hard core annotators,
>> the code needs to accommodate the idiosyncrasies of highlighting first. For
>> example, if there are thousands of highlights on a page, an annotation
>> builder might iterate/walk the document just once and fill in the thousands
>> of highlights in one pass. Also, a highlighting app would probably need to
>> modify the source document by inserting spans and such.
>>
>> If Randall needs familiar code for node iteration, tree walking, range
>> splitting, string similarity and normalization, that's cool! Custom code,
>> *especially* Polyfill type implementations, could smooth over browser
>> idiosyncrasies. Also, I saw a Jsperf.com microbenchmark that put custom
>> walkers on par with the native browser based ones.
>>
>> On a personal note, I do archival work and did not initially see the value
>> in modifying the source document by inserting spans (however, a highlight
>> app would need that). The main reason I'm excited about annotation is its
>> value for labeling data for text analysis and machine learning. A lot of
>> the advancements in machine learning are because of large bodies of data
>> that have been tagged. The most common examples are usually of images that
>> have regions selected and then labeled, but annotation could also help turn
>> semi-structured text into more structured text data (e.g. for labeling
>> parts of government documents). For archival work on mostly static
>> documents, there does not seem to be a need to modify source document. On
>> the other hand, for dynamically changing documents, inserting spans with
>> unique IDs seems appropriate because its more robust to document changes.
>> Yet, it is also vulnerable to turf battles with other extensions and the
>> page's own javascript, so I hope it's not a requirement of the Apache
>> library but rather a feature.
>>
>>
>> On Thu, May 11, 2017 at 1:43 PM Benjamin Young <byoung@bigbluehat.com>
>> wrote:
>>
>>> Exciting to see this conversation happening. ^_^
>>>
>>>
>>> Randall, how feasible would it be to bring (soon) your libraries (even
>>> via copy/paste) into the Apache Annotator repo. I believe (according to
>>> GitHub) you're author/owner of 90%+ of the code in them, and (consequently)
>>> able to do that if you believe that's the right step.
>>>
>>>
>>> Sasha you're classes modeled around the selector and a "builder" sound
>>> very similar to the hopes I wrote up in
>>> https://cwiki.apache.org/confluence/display/ANNO/Planning
>>>
>>>
>>> I'd very much like to combine these efforts in some way.
>>>
>>>
>>> Additionally--and the thing driving me personally at the moment--I have
>>> to present on Apache Annotator next Wednesday!
>>>
>>> https://apachecon2017.sched.com/event/AbBW
>>>
>>>
>>> Consequently, I'd very much love it if we (collectively) could build a
>>> demo together! There's plenty to talk about wrt to annotation, community
>>> building, Web Annotation Data Model & Protocol, as well as why (those of
us
>>> that are here at least) have chosen to start collaborating at the ASF.
>>>
>>>
>>> At any rate, I plan to be coding on all the things leading up to
>>> Wednesday, so any help, input, pointers, and code (hehe) that anyone wants
>>> to toss in ahead of my codez, I'd be most grateful to code together!
>>>
>>>
>>> Thanks, all!
>>>
>>> Benjamin
>>>
>>> --
>>>
>>> http://bigbluehat.com/
>>>
>>> http://linkedin.com/in/benjaminyoung
>>>
>>> ________________________________
>>> From: Randall Leeds <randall@apache.org>
>>> Sent: Thursday, May 11, 2017 3:34:24 PM
>>> To: dev@annotator.incubator.apache.org
>>> Subject: DOM Iteration (was Re: Just a simple example?)
>>>
>>> Great to see you here, Sasha!
>>>
>>> On Wed, May 10, 2017 at 5:39 PM Sasha Goodman <email@sashagoodman.com>
>>> wrote:
>>>
>>> >
>>> > P.S. This afternoon I streamlined the TextQuoteSelector and
>>> > TextPositionSelector to work (in principle ) consistently with Randall
>>> > Leed's implementation that used NodeIterator and textContents.
>>> >
>>> >
>>> Neat :).
>>>
>>> I think my takeaway from the simple example thread, and something of which
>>> many of us were likely already well aware, is that there's a desire for a
>>> good highlighter implementation. A way to highlight text is often the
>>> first
>>> example people want to see.
>>>
>>> While I hope to see experimentation with implementations that try to limit
>>> the impact on the DOM, I think <mark> or <span> wrapping of text
nodes is
>>> still the easiest to understand. In this approach, the actual wrapping is
>>> easy. The difficult part is iteration.
>>>
>>> Now, some quick background on node iteration.
>>>
>>> I chose to use NodeIterator rather than TreeWalker for my dom-seek library
>>> because it meant that the seek function could be stateless, support
>>> seeking
>>> forward and backward, and still be able to return the number of characters
>>> consumed by a seek. The desire to know whether to include the current
>>> node's content in the seek count is fulfilled by NodeIterator's
>>> "pointerBeforeReferenceNode". Essentially, a NodeIterator stores a point
>>> before or after a node, rather than simply a current node.
>>>
>>> However, using NodeIterator to traverse a Range is not really great. Since
>>> it has a read only currentNode, the best that can be done is to start with
>>> the commonAncestorContainer of the Range. Range has compareNode,
>>> comparePoint, and isPointInRange. I have no idea how expensive these are.
>>> Iterating all the nodes under the commonAncestorContainer doesn't feel
>>> great to begin with. TreeWalker might be more appropriate since its
>>> currentNode could be set to startContainer directly. TreeWalker also
>>> appears to have consistent platform support.
>>>
>>> All of this is complicated by the Range being able to point to offsets
>>> within text nodes. For the purposes of highlighting with wrapper elements
>>> it's necessary to split the boundary nodes. I think there are probably a
>>> number of libraries for this, but I propose we write one under our repo.
>>>
>>> We might also find that normalizing the endpoints of a Range in some
>>> fashion is a helpful prerequisite. There is a library I found that does
>>> this, but I found its algorithm terribly confusing. I put time into
>>> rewriting it without dependencies. Despite some initial excitement, the
>>> author never fully vetted and accepted my pull request:
>>> https://github.com/webmodules/range-normalize/pull/2
>>>
>>> In conclusion, I think there'd be value in bringing some functional
>>> utilities into Apache Annotator for dealing with iteration, range
>>> splitting, and range normalization, with the goal of providing a very
>>> succinct and simple highlighter that looks like this:
>>>
>>> ```
>>> for (const node of textNodes(range)) {
>>>   const mark = document.createElement('mark');
>>>   node.replaceWith(mark);
>>>   mark.appendChild(node);
>>> }
>>> ```
>>>
>>> Some care needs to be taken that whatever iteration we use is not
>>> invalidated by the replacement of the text node with its wrapper.
>>>
>>> The fact that a simple example like this is hard to produce is evidence of
>>> the underlying complexity described in the above paragraphs. When I see
>>> people wanting a simple highlighter what I hear is that they actually need
>>> simple abstractions upon which to build a highlighter. The highlighter
>>> itself should be easy. Often, highlighters that projects provide are not
>>> shipped standalone or don't do exactly what the author needs (use spans
>>> instead of marks, add a particular class, coalesce overlapping highlights
>>> or not, etc). There is lots of room to do different things but being able
>>> to simply get the nodes to be highlighted is the prerequisite task that
>>> contains most of the complexity.
>>>
>>> That's all (and probably way too much) for now. Finding all the tools for
>>> all these things is a pain enough that I think we should have a
>>> comprehensive set of such utilities in Apache Annotator, even if that
>>> risks
>>> looking like a bit of NIH syndrome.
>>>
>>> Unless anyone objects, I think I'll aim to ship libraries for these:
>>> - Node iteration (https://github.com/tilgovi/dom-node-iterator)
>>> - Tree walking (might not need a library if support is good)
>>> - Range splitting
>>> - Range normalization (see my pull request reference, above)
>>> - Range iterating
>>> - Text distance (https://github.com/tilgovi/dom-seek)
>>>
>>> If anyone wants to start on any of the above, you're welcome to depend on
>>> libraries that are outside Apache Annotator. In the case of libraries that
>>> I've written, there is value to bringing them into Apache Annotator
>>> because
>>> they are all written in ES6 but not packaged to be consumed as ES6.
>>> Bringing them inside our repo means better code deduplication by tree
>>> shaking in tools like rollup and webpack. They could be packaged as ES6
>>> where they are, but if I'm going to spend time improving the packaging I
>>> would rather just toss out the packaging and get the benefits of the
>>> monorepo having all that build/test boilerplate done once for all of them.
>>>
>>

Mime
View raw message