GROUP BY

A GROUP BY can be used to specify which columns to use for grouping values during aggregate.

When an aggregate is executed with no GROUP BY clause, all rows are used to produce the final aggregate value. When a GROUP BY is provided, aggregate values are produced for each group.

For example, we can get the count, minimum, maximum, and average city population for each state:

-- Get count, min, max, and average city population for each state
SELECT
    state_abbr,
    count(*),
    min(population),
    max(population),
    avg(population)
FROM cities
GROUP BY state_abbr;

state_abbr	count	min	max	avg
TX	4	979882	2314157	1523050.5
OH	3	311097	913175	528976

A GROUP BY clause can also be used without any aggregate functions in the SELECT list. In this case, it collapses all rows for each group into a single output row, effectively producing the distinct values for the grouped columns (similar to the DISTINCT keyword).

ROLLUP and CUBE

ROLLUP and CUBE are extensions to the GROUP BY clause that allow generating subtotals and grand totals across multiple dimensions in a single query. They produce multiple grouping sets based on the columns specified.

For the following examples, we'll be working some city population data:

CREATE TEMP TABLE cities (name TEXT, population INT, state_abbr TEXT);
INSERT INTO cities VALUES
  ('Houston', 2314157, 'TX'),
  ('Austin', 979882, 'TX'),
  ('Dallas', 1302868, 'TX'),
  ('San Antonio', 1495295, 'TX'),
  ('Columbus', 913175, 'OH'),
  ('Cincinnati', 311097, 'OH'),
  ('Cleveland', 362656, 'OH');

If we wanted to find the average city population per state, as well as the overall city population average, we can use grouping sets to specify those dimensions.

We can use a ROLLUP for this which will produce the grouping sets (state_abbr, ()), with empty paranthesis representing the "empty" group (how we'll get the overall average):

SELECT state_abbr, avg(population)
FROM cities
GROUP BY ROLLUP (state_abbr);

CUBE

CUBE generates grouping sets for all possible combinations of the columns listed. For GROUP BY CUBE (a, b), it produces grouping sets for (a, b), (a), (b), and ().

Example:

Calculate the average population grouped by all combinations of state_abbr and name. We use GROUPING(state_abbr, name) to identify the subtotal and grand total rows. The GROUPING function returns a bitmask: 0 for base rows, 1 when name is aggregated (state subtotal), 2 when state_abbr is aggregated (city subtotal across states), and 3 when both are aggregated (grand total).

SELECT
    state_abbr,
    name,
    avg(population),
    GROUPING(state_abbr, name) AS grp_level
FROM cities
GROUP BY CUBE (state_abbr, name)
ORDER BY state_abbr NULLS LAST, name NULLS LAST;

state_abbr	name	avg	grp_level	Description (not in output)
OH	Cincinnati	311097	0	Base row
OH	Cleveland	362656	0	Base row
OH	Columbus	913175	0	Base row
OH	NULL	528976	1	OH subtotal
TX	Austin	979882	0	Base row
TX	Dallas	1302868	0	Base row
TX	Houston	2314157	0	Base row
TX	San Antonio	1495295	0	Base row
TX	NULL	1523050.5	1	TX subtotal
NULL	Austin	979882	2	Austin subtotal
NULL	Cincinnati	311097	2	Cincinnati subtotal
NULL	Cleveland	362656	2	Cleveland subtotal
NULL	Columbus	913175	2	Columbus subtotal
NULL	Dallas	1302868	2	Dallas subtotal
NULL	Houston	2314157	2	Houston subtotal
NULL	San Antonio	1495295	2	San Antonio subtotal
NULL	NULL	1097018.5714285714	3	Grand total

`GROUPING` Function

When using ROLLUP or CUBE, the result set may contain NULL values. These NULLs can either represent actual NULL values present in the grouped columns of the input data, or they can be generated by the ROLLUP/CUBE operation to indicate a subtotal or grand total row.

The GROUPING() function helps distinguish between these cases. It takes one or more column names specified in the GROUP BY clause as arguments.

If a column is included in the grouping for a specific result row, GROUPING(column) returns 0.
If a column is not included in the grouping (i.e., the row represents an aggregation across that dimension, often indicated by NULL), GROUPING(column) returns 1.

When multiple arguments are provided, GROUPING(col1, col2, ...) returns an integer representing a bitmask, where the least significant bit corresponds to the rightmost argument. For example, GROUPING(a, b) returns:

0 if both a and b are part of the grouping.
1 if a is part of the grouping, but b is not (aggregated over).
2 if b is part of the grouping, but a is not (aggregated over).
3 if neither a nor b are part of the grouping (aggregated over).

Example:

Using the ROLLUP example from before, we can use GROUPING(state_abbr) to identify the grand total row:

SELECT
    state_abbr,
    avg(population),
    GROUPING(state_abbr) AS grp_state
FROM cities
GROUP BY ROLLUP (state_abbr);

Result:

state_abbr	avg	grp_state
NULL	1097018.5714285714	1
TX	1523050.5	0
OH	528976	0

Column References

When using GROUP BY, any non-aggregated column present in the SELECT list must also be included in the GROUP BY clause. GlareDB enforces this standard SQL rule.

Example:

This query is invalid because name is in the SELECT list but not aggregated and not in the GROUP BY clause:

-- INVALID Query
SELECT state_abbr, name, max(population)
FROM cities
GROUP BY state_abbr;
-- Error: Column 'name' must appear in the GROUP BY clause or be used in an aggregate function

To fix this, either include name in the GROUP BY clause (if you want to group by state and name) or remove it from the SELECT list.

-- Valid Query (grouping by state only)
SELECT state_abbr, max(population)
FROM cities
GROUP BY state_abbr;

-- Valid Query (grouping by state and name)
SELECT state_abbr, name, max(population)
FROM cities
GROUP BY state_abbr, name;

GROUP BY

ROLLUP and CUBE

CUBE

GROUPING Function

Column References

`GROUPING` Function