hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Loren Siebert <>
Subject Re: Hive question, summing second-level domain names
Date Mon, 23 May 2011 20:11:50 GMT
Use regexp_extract() and group on that.

For example, to get class C IP block from an IP address in a column called ‘host':

    regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc


On May 23, 2011, at 1:03 PM, Adam Phelps wrote:

> (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been
struggling a bit with the Language manual which seems to assume previous SQL experience)
> Suppose I have a table, within which there is a column which contains domain names (ie
such as  I want to perform a count of all second-level domains, ie
and would count in the same bucket.
> Now I could count things for a particular second-level domain like this:
>  year, month, day, hour, COUNT(1) as count
>  domainlog
>  year = 2011 AND
>  month = 05 AND
>  day = 15 AND
>  (
>    domain RLIKE ".*[.]apache[.]org"
>  )
>  year, month, day, hour
> however I'm not seeing a way to sum up all second-level domains rather than a particular
one.  I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*"
and then output lines with a count for the common portion.  Any pointers in the correct direction
would be welcome.
> Thanks
> - Adam

View raw message