GlareDB

Parquet Extension

The parquet extension enables direct querying of Parquet files. It is included by default in the CLI, Python, and WebAssembly (Wasm) bindings.

Functions

read_parquet

Alias: parquet_scan

The read_parquet function takes a path to a Parquet file and returns a table containing the data.

SELECT * FROM read_parquet('cities.parquet');

By default, read_parquet will automatically infer column data types from the Parquet file schema.

You can inspect the inferred column names and types using the DESCRIBE statement:

DESCRIBE read_parquet('cities.parquet');

This returns a table with the name and data type of each column.

For S3 sources, additional parameters can be provided:

SELECT * FROM read_parquet('s3://bucket-name/path/to/file.parquet', 
                          region='us-east-1', 
                          access_key_id='YOUR_ACCESS_KEY', 
                          secret_access_key='YOUR_SECRET_KEY');

Direct URI Querying

Parquet files can also be queried directly by using the file path or URI in the FROM clause:

SELECT * FROM 'cities.parquet';

parquet_file_metadata

Returns high-level metadata about a Parquet file.

ColumnDescription
file_nameName of the file being queried.
versionParquet format version used in the file.
num_rowsTotal number of rows in the file.
create_byApplication or library that wrote the file.
num_row_groupsNumber of row groups contained within the file.

parquet_rowgroup_metadata

Returns metadata for each row group within a Parquet file.

ColumnDescription
file_nameName of the file being queried.
num_rowsNumber of rows in the row group.
num_columnsNumber of columns in the row group.
uncompressed_sizeUncompressed size of the row group in bytes.