hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takuya Kitazawa (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (HIVEMALL-146) Implement yet another UDF to generate n-grams from a list of words
Date Wed, 04 Oct 2017 03:18:00 GMT

     [ https://issues.apache.org/jira/browse/HIVEMALL-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takuya Kitazawa closed HIVEMALL-146.
------------------------------------
    Resolution: Done

> Implement yet another UDF to generate n-grams from a list of words
> ------------------------------------------------------------------
>
>                 Key: HIVEMALL-146
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-146
>             Project: Hivemall
>          Issue Type: New Feature
>            Reporter: Takuya Kitazawa
>            Assignee: Takuya Kitazawa
>
> Hive has {{ngrams()}} function to obtain n-grams of a list of words: https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-ngrams()andcontext_ngrams():N-gramfrequencyestimation
> While the existing function returns "estimated" top-k list of frequent n-grams, NLP applications
sometimes need to get "exact" list of n-grams which include all of 1-, 2-, ..., n-grams. To
give an example, for an input \["machine", "learning"\], we might expect to get the following
result: \["machine", "learning", "machine learning"\].
> Hence, this ticket requests to implement yet another UDF something like {{ngrams()}}.
Implementation could be similar to {{getNgrams()}} in the Stanford CoreNLP library: https://github.com/stanfordnlp/CoreNLP/blob/d6318a0cb06dba635550477bc843952cc5a5f868/src/edu/stanford/nlp/util/StringUtils.java#L2132-L2142



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message