A GROUP BY
can be used to specify which columns to use for grouping values
during aggregate.
When an aggregate is executed with no GROUP BY
clause, all rows are used to
produce the final aggregate value. When a GROUP BY
is provided, aggregate
values are produced for each group.
For example, we can get the count, minimum, maximum, and average city population for each state:
-- Get count, min, max, and average city population for each state
SELECT
state_abbr,
count(*),
min(population),
max(population),
avg(population)
FROM cities
GROUP BY state_abbr;
state_abbr | count | min | max | avg |
---|---|---|---|---|
TX | 4 | 979882 | 2314157 | 1523050.5 |
OH | 3 | 311097 | 913175 | 528976 |
A GROUP BY
clause can also be used without any aggregate functions in the
SELECT
list. In this case, it collapses all rows for each group into a single
output row, effectively producing the distinct values for the grouped columns
(similar to the DISTINCT
keyword).
ROLLUP
and CUBE
are extensions to the GROUP BY
clause that allow
generating subtotals and grand totals across multiple dimensions in a single
query. They produce multiple grouping sets based on the columns specified.
For the following examples, we'll be working some city population data:
CREATE TEMP TABLE cities (name TEXT, population INT, state_abbr TEXT);
INSERT INTO cities VALUES
('Houston', 2314157, 'TX'),
('Austin', 979882, 'TX'),
('Dallas', 1302868, 'TX'),
('San Antonio', 1495295, 'TX'),
('Columbus', 913175, 'OH'),
('Cincinnati', 311097, 'OH'),
('Cleveland', 362656, 'OH');
If we wanted to find the average city population per state, as well as the overall city population average, we can use grouping sets to specify those dimensions.
We can use a ROLLUP
for this which will produce the grouping sets
(state_abbr, ())
, with empty paranthesis representing the "empty" group (how
we'll get the overall average):
SELECT state_abbr, avg(population)
FROM cities
GROUP BY ROLLUP (state_abbr);
CUBE
generates grouping sets for all possible combinations of the columns
listed. For GROUP BY CUBE (a, b)
, it produces grouping sets for (a, b)
,
(a)
, (b)
, and ()
.
Example:
Calculate the average population grouped by all combinations of state_abbr
and
name
. We use GROUPING(state_abbr, name)
to identify the subtotal and grand
total rows. The GROUPING
function returns a bitmask: 0 for base rows, 1 when
name
is aggregated (state subtotal), 2 when state_abbr
is aggregated (city
subtotal across states), and 3 when both are aggregated (grand total).
SELECT
state_abbr,
name,
avg(population),
GROUPING(state_abbr, name) AS grp_level
FROM cities
GROUP BY CUBE (state_abbr, name)
ORDER BY state_abbr NULLS LAST, name NULLS LAST;
state_abbr | name | avg | grp_level | Description (not in output) |
---|---|---|---|---|
OH | Cincinnati | 311097 | 0 | Base row |
OH | Cleveland | 362656 | 0 | Base row |
OH | Columbus | 913175 | 0 | Base row |
OH | NULL | 528976 | 1 | OH subtotal |
TX | Austin | 979882 | 0 | Base row |
TX | Dallas | 1302868 | 0 | Base row |
TX | Houston | 2314157 | 0 | Base row |
TX | San Antonio | 1495295 | 0 | Base row |
TX | NULL | 1523050.5 | 1 | TX subtotal |
NULL | Austin | 979882 | 2 | Austin subtotal |
NULL | Cincinnati | 311097 | 2 | Cincinnati subtotal |
NULL | Cleveland | 362656 | 2 | Cleveland subtotal |
NULL | Columbus | 913175 | 2 | Columbus subtotal |
NULL | Dallas | 1302868 | 2 | Dallas subtotal |
NULL | Houston | 2314157 | 2 | Houston subtotal |
NULL | San Antonio | 1495295 | 2 | San Antonio subtotal |
NULL | NULL | 1097018.5714285714 | 3 | Grand total |
GROUPING
FunctionWhen using ROLLUP
or CUBE
, the result set may contain NULL
values. These
NULL
s can either represent actual NULL
values present in the grouped columns
of the input data, or they can be generated by the ROLLUP
/CUBE
operation to
indicate a subtotal or grand total row.
The GROUPING()
function helps distinguish between these cases. It takes one or
more column names specified in the GROUP BY
clause as arguments.
GROUPING(column)
returns 0.NULL
),
GROUPING(column)
returns 1.When multiple arguments are provided, GROUPING(col1, col2, ...)
returns an
integer representing a bitmask, where the least significant bit corresponds to
the rightmost argument. For example, GROUPING(a, b)
returns:
a
and b
are part of the grouping.a
is part of the grouping, but b
is not (aggregated over).b
is part of the grouping, but a
is not (aggregated over).a
nor b
are part of the grouping (aggregated over).Example:
Using the ROLLUP
example from before, we can use GROUPING(state_abbr)
to identify the grand total row:
SELECT
state_abbr,
avg(population),
GROUPING(state_abbr) AS grp_state
FROM cities
GROUP BY ROLLUP (state_abbr);
Result:
state_abbr | avg | grp_state |
---|---|---|
NULL | 1097018.5714285714 | 1 |
TX | 1523050.5 | 0 |
OH | 528976 | 0 |
When using GROUP BY
, any non-aggregated column present in the SELECT
list
must also be included in the GROUP BY
clause. GlareDB enforces this standard
SQL rule.
Example:
This query is invalid because name
is in the SELECT
list but not
aggregated and not in the GROUP BY
clause:
-- INVALID Query
SELECT state_abbr, name, max(population)
FROM cities
GROUP BY state_abbr;
-- Error: Column 'name' must appear in the GROUP BY clause or be used in an aggregate function
To fix this, either include name
in the GROUP BY
clause (if you want to
group by state and name) or remove it from the SELECT
list.
-- Valid Query (grouping by state only)
SELECT state_abbr, max(population)
FROM cities
GROUP BY state_abbr;
-- Valid Query (grouping by state and name)
SELECT state_abbr, name, max(population)
FROM cities
GROUP BY state_abbr, name;