GlareDB

GROUP BY

A GROUP BY can be used to specify which columns to use for grouping values during aggregate.

When an aggregate is executed with no GROUP BY clause, all rows are used to produce the final aggregate value. When a GROUP BY is provided, aggregate values are produced for each group.

For example, we can get the count, minimum, maximum, and average city population for each state:

-- Get count, min, max, and average city population for each state
SELECT
    state_abbr,
    count(*),
    min(population),
    max(population),
    avg(population)
FROM cities
GROUP BY state_abbr;
state_abbrcountminmaxavg
TX497988223141571523050.5
OH3311097913175528976

A GROUP BY clause can also be used without any aggregate functions in the SELECT list. In this case, it collapses all rows for each group into a single output row, effectively producing the distinct values for the grouped columns (similar to the DISTINCT keyword).

ROLLUP and CUBE

ROLLUP and CUBE are extensions to the GROUP BY clause that allow generating subtotals and grand totals across multiple dimensions in a single query. They produce multiple grouping sets based on the columns specified.

For the following examples, we'll be working some city population data:

CREATE TEMP TABLE cities (name TEXT, population INT, state_abbr TEXT);
INSERT INTO cities VALUES
  ('Houston', 2314157, 'TX'),
  ('Austin', 979882, 'TX'),
  ('Dallas', 1302868, 'TX'),
  ('San Antonio', 1495295, 'TX'),
  ('Columbus', 913175, 'OH'),
  ('Cincinnati', 311097, 'OH'),
  ('Cleveland', 362656, 'OH');

If we wanted to find the average city population per state, as well as the overall city population average, we can use grouping sets to specify those dimensions.

We can use a ROLLUP for this which will produce the grouping sets (state_abbr, ()), with empty paranthesis representing the "empty" group (how we'll get the overall average):

SELECT state_abbr, avg(population)
FROM cities
GROUP BY ROLLUP (state_abbr);

CUBE

CUBE generates grouping sets for all possible combinations of the columns listed. For GROUP BY CUBE (a, b), it produces grouping sets for (a, b), (a), (b), and ().

Example:

Calculate the average population grouped by all combinations of state_abbr and name. We use GROUPING(state_abbr, name) to identify the subtotal and grand total rows. The GROUPING function returns a bitmask: 0 for base rows, 1 when name is aggregated (state subtotal), 2 when state_abbr is aggregated (city subtotal across states), and 3 when both are aggregated (grand total).

SELECT
    state_abbr,
    name,
    avg(population),
    GROUPING(state_abbr, name) AS grp_level
FROM cities
GROUP BY CUBE (state_abbr, name)
ORDER BY state_abbr NULLS LAST, name NULLS LAST;
state_abbrnameavggrp_levelDescription (not in output)
OHCincinnati3110970Base row
OHCleveland3626560Base row
OHColumbus9131750Base row
OHNULL5289761OH subtotal
TXAustin9798820Base row
TXDallas13028680Base row
TXHouston23141570Base row
TXSan Antonio14952950Base row
TXNULL1523050.51TX subtotal
NULLAustin9798822Austin subtotal
NULLCincinnati3110972Cincinnati subtotal
NULLCleveland3626562Cleveland subtotal
NULLColumbus9131752Columbus subtotal
NULLDallas13028682Dallas subtotal
NULLHouston23141572Houston subtotal
NULLSan Antonio14952952San Antonio subtotal
NULLNULL1097018.57142857143Grand total

GROUPING Function

When using ROLLUP or CUBE, the result set may contain NULL values. These NULLs can either represent actual NULL values present in the grouped columns of the input data, or they can be generated by the ROLLUP/CUBE operation to indicate a subtotal or grand total row.

The GROUPING() function helps distinguish between these cases. It takes one or more column names specified in the GROUP BY clause as arguments.

  • If a column is included in the grouping for a specific result row, GROUPING(column) returns 0.
  • If a column is not included in the grouping (i.e., the row represents an aggregation across that dimension, often indicated by NULL), GROUPING(column) returns 1.

When multiple arguments are provided, GROUPING(col1, col2, ...) returns an integer representing a bitmask, where the least significant bit corresponds to the rightmost argument. For example, GROUPING(a, b) returns:

  • 0 if both a and b are part of the grouping.
  • 1 if a is part of the grouping, but b is not (aggregated over).
  • 2 if b is part of the grouping, but a is not (aggregated over).
  • 3 if neither a nor b are part of the grouping (aggregated over).

Example:

Using the ROLLUP example from before, we can use GROUPING(state_abbr) to identify the grand total row:

SELECT
    state_abbr,
    avg(population),
    GROUPING(state_abbr) AS grp_state
FROM cities
GROUP BY ROLLUP (state_abbr);

Result:

state_abbravggrp_state
NULL1097018.57142857141
TX1523050.50
OH5289760

Column References

When using GROUP BY, any non-aggregated column present in the SELECT list must also be included in the GROUP BY clause. GlareDB enforces this standard SQL rule.

Example:

This query is invalid because name is in the SELECT list but not aggregated and not in the GROUP BY clause:

-- INVALID Query
SELECT state_abbr, name, max(population)
FROM cities
GROUP BY state_abbr;
-- Error: Column 'name' must appear in the GROUP BY clause or be used in an aggregate function

To fix this, either include name in the GROUP BY clause (if you want to group by state and name) or remove it from the SELECT list.

-- Valid Query (grouping by state only)
SELECT state_abbr, max(population)
FROM cities
GROUP BY state_abbr;

-- Valid Query (grouping by state and name)
SELECT state_abbr, name, max(population)
FROM cities
GROUP BY state_abbr, name;