hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/StatisticsAndDataMining" by MayankLahiri
Date Thu, 19 Aug 2010 20:19:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/StatisticsAndDataMining" page has been changed by MayankLahiri.


New page:
= Statistics and Data Mining in Hive =

This page is a central repository for the slightly more advanced statistical and data mining
functions that are being integrated into Hive, and especially the functions that warrant more
than one-line descriptions. 


== ngrams() and context_ngrams(): N-gram frequency estimation ==

[[http://en.wikipedia.org/wiki/N-gram|N-grams]] are subsequences of length '''N''' drawn from
a longer sequence. The purpose of the `ngrams()` UDAF is to find the `k` most frequent n-grams
from one or more sequences. It can be used in conjunction with the `sentences()` UDF to analyze
unstructured natural language text, or the `collect()` function to analyze more general string

Contextual n-grams are similar to n-grams, but allow you to specify a 'context' string around
which n-grams are to be estimated. For example, you can specify that you're only interested
in finding the most common two-word phrases in text that follow the context "I love". You
could achieve the same result by manually stripping sentences of non-contextual content and
then passing them to `ngrams()`, but `context_ngrams()` makes it much easier.

=== Use Cases ===

 1. (ngrams) Find important topics in text in conjunction with a stopword list.
 2. (ngrams) Find trending topics in text.
 3. (context_ngrams) Extract marketing intelligence around certain words (e.g., "Twitter is
 4. (ngrams) Find frequently accessed URL sequences.
 5. (context_ngrams) Find frequently accessed URL sequences that start or end at a particular
 6. (context_ngrams) Pre-compute common search lookaheads.

=== Usage ===

SELECT context_ngrams(sentences(lower(tweet)), 2, 100 [, 1000]) FROM twitter;

The command above will return the top-100 bigrams (2-grams) from a hypothetical table called
`twitter`. The `tweet` column is assumed to contain a string with arbitrary, possibly meaningless,
text. The `lower()` UDF first converts the text to lowercase for standardization, and then
`sentences()` splits up the text into arrays of words. The optional fourth argument is the
'''precision factor''' that control the tradeoff between memory usage and accuracy in frequency
estimation. Higher values will be more accurate, but could potentially crash the JVM with
an OutOfMemory error. If omitted, sensible defaults are used.

SELECT context_ngrams(sentences(lower(tweet)), array("i","love",null), 100, [, 1000]) FROM

The command above will return a list of the top 100 words that follow the phrase "i love"
in a hypothetical database of Twitter tweets. Each `null` specifies the position of an n-gram
component to estimate; therefore, every query must contain at least one `null` in the context

Note that the following two queries are identical, but `ngrams()` will be slightly faster
in practice.

SELECT ngrams(sentences(lower(tweet)), 2, 100 [, 1000]) FROM twitter;
SELECT context_ngrams(sentences(lower(tweet)), array(null,null), 100, [, 1000]) FROM twitter;

=== Example ===

SELECT explode(ngrams(sentences(lower(val)), 2, 10)) AS x FROM kafka;

SELECT explode(context_ngrams(sentences(lower(val)), array("he", null), 10)) AS x FROM kafka;

== histogram_numeric(): Estimating frequency distributions ==

View raw message