hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carl Steinbach (JIRA)" <>
Subject [jira] Updated: (HIVE-1438) sentences() UDF for natural language tokenization
Date Mon, 03 Jan 2011 12:44:46 GMT


Carl Steinbach updated HIVE-1438:

    Component/s:     (was: Query Processor)

> sentences() UDF for natural language tokenization
> -------------------------------------------------
>                 Key: HIVE-1438
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: UDF
>    Affects Versions: 0.7.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.7.0
>         Attachments: HIVE-1438.1.patch, HIVE-1438.2.patch
> Create a generic UDF that tokenizes free-form natural language text into sentences and
words for more advanced processing, while stripping unnecessary punctuation and being fully
international-aware. Fortunately, most of this functionality is already built into Java in
the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:
> > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
> [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]
> or
> > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
> [["Je","m'apelle","hive"]]
> Notice how punctuation is maintained only where appropriate. Breaking at sentences (and
thus the nested array return type) is important for tasks like counting the frequency of n-grams
in text, which should not cross sentence boundaries.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message