lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)
Date Mon, 28 May 2018 19:32:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492897#comment-16492897
] 

David Smiley commented on SOLR-12376:
-------------------------------------

Updated patch to use the new ConcatenateGraphFilterFactory (which is a WIP; not committed
yet LUCENE-8332). CGFF supports synonyms and other filters producing stacked tokens at indexing
time. This is _very_ useful for the tagger!
* I added a test for this -- testWDF to test that WordDelimiterGraphFilter works with catenation
options.
* partial tagging (via shingling) is no longer easily supported so I commented this out. It
has to do with difficulties in configuring the separator char (CGFF doesn't have this configurable).
This feature is probably dubious any way.

Added docs, which was an amalgamation of the SolrTextTagger's existing README and QUICK_START files hand-edited/massaged
some. I verified the tutorial instructions. I added a bin/post version of sending the CSV.
 That was a bit of a pain to figure out.

At this point it's ready but pending LUCENE-8332.  

> New TaggerRequestHandler (aka SolrTextTagger)
> ---------------------------------------------
>
>                 Key: SOLR-12376
>                 URL: https://issues.apache.org/jira/browse/SOLR-12376
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the SolrTextTagger
from the OpenSextant project [https://github.com/OpenSextant/SolrTextTagger]. It's used for
named entity recognition (NER) of text past to it. It doesn't do any NLP (outside of Lucene
text analysis) so it's said to be a "naive tagger", but it's definitely useful as-is and a
more complete NER or ERD (entity recognition and disambiguation) system can be built with
this as a key component. The SolrTextTagger has been used on queries for query-understanding,
and it's been used on full-text, and it's been used on dictionaries that number tens of millions
in size. Since it's small and has been used a bunch (including helping win an ERD competition
and in [Apache Stanbol|https://stanbol.apache.org/]), several people have asked me when or
why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like field (short
text) indexed with the ConcatenateFilter (LUCENE-8323) at the end. We call this the dictionary.
Once that's in place, you simply post text to a {{TaggerRequestHandler}} and it returns the
offset pairs into that text for matches in the dictionary along with the uniqueKey of the
matching documents. It can also return other document data desired. That's the gist; I'll
add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message