Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
83cf3f2
adding denmark data
balit-raibot Dec 11, 2025
9fc9408
Merge branch 'master' into denmark_census
balit-raibot Jan 26, 2026
ba12706
standardizing file names
balit-raibot Jan 26, 2026
0515e4a
removing old files
balit-raibot Jan 26, 2026
af14a8d
added two imports
balit-raibot Jan 26, 2026
ced166d
resolving comments
balit-raibot Jan 26, 2026
3e62f91
ignored age above 105
niveditasing Jan 29, 2026
e68d166
ignored age above 105
niveditasing Jan 29, 2026
c2835a1
removing stat_vars.mcf file
smarthg-gi Feb 16, 2026
3cb8ed6
Updating directory structure
smarthg-gi Feb 16, 2026
5fc020e
Updating directory structure
smarthg-gi Feb 16, 2026
7bac0f6
Adding Readme file
smarthg-gi Feb 16, 2026
076957b
reorg folder structure
balit-raibot Feb 16, 2026
0467aba
added future dates
balit-raibot Feb 16, 2026
2f333cf
updated README
balit-raibot Feb 16, 2026
4c9b6fc
Merge branch 'master' into denmark_census
balit-raibot Feb 25, 2026
ed1713b
Merge branch 'master' into denmark_census
balit-raibot Mar 12, 2026
16ec59f
Merge branch 'master' into denmark_census
balit-raibot Mar 15, 2026
83c79ef
converted manual downloading to automatic
balit-raibot Mar 16, 2026
b6bd1a6
Merge branch 'denmark_census' of https://github.com/balit-raibot/data…
balit-raibot Mar 16, 2026
7780a0a
adding manifest.json
balit-raibot Mar 16, 2026
fe2e2d3
adding manifest.json
balit-raibot Mar 16, 2026
f905144
added a flag to download all data instead of current and previous year
balit-raibot Mar 16, 2026
4eb4232
increased timeout from 1 hour to 10 hours
balit-raibot Mar 16, 2026
2d4e5e3
refactored download script to download country level stats
balit-raibot Mar 17, 2026
b90c413
Merge branch 'master' into denmark_census
balit-raibot Mar 17, 2026
bb91162
modified manifest resource limits
balit-raibot Mar 17, 2026
1a276dc
Merge branch 'denmark_census' of https://github.com/balit-raibot/data…
balit-raibot Mar 17, 2026
b4afc92
Merge branch 'master' into denmark_census
balit-raibot Mar 17, 2026
44dab64
Merge branch 'master' into denmark_census
balit-raibot Mar 18, 2026
9c0b100
addressing comments to refactor download script
balit-raibot Mar 23, 2026
6e4e437
Merge branch 'denmark_census' of https://github.com/balit-raibot/data…
balit-raibot Mar 23, 2026
b89e793
modularising the code
balit-raibot Mar 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions statvar_imports/denmark_demographics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Statistics Denmark Demographics Dataset
## Overview
This dataset contains demographic statistics for the population of Denmark, sourced from Statistics Denmark. It includes two primary datasets covering quarterly and annual population breakdowns across various dimensions like geography (regions and municipalities), sex, age, and marital status.

The import covers:
- **Population (Quarterly):** Population count by region, marital status, age, and sex at the first day of each quarter (Table FOLK1A).
- **Population (Annual):** Population count by sex and age groups.

Type of place: Country

## Data Source
**Source URL:**
- Main Portal: https://www.statbank.dk/statbank5a/default.asp?w=1396
- Specific Table (FOLK1A): https://www.statbank.dk/FOLK1A

**Provenance Description:**
The data is provided by Statistics Denmark, the central authority for Danish statistics. The population figures are derived from the Central Person Register (CPR) and reflect the population residing in Denmark on the first day of the period.

## How To Download Input Data
To download the data manually:
1. Go to the [StatBank Denmark Portal](https://www.statbank.dk/statbank5a/default.asp?w=1396).
2. Browse or search for the desired population tables. For quarterly demographics, search for table **FOLK1A** (Population at the first day of the quarter).
3. Select the desired variables:
- **Region:** All Denmark.
- **Marital Status:** Total, Never married, Married/separated, Widowed, Divorced.
- **Age:** Individual ages or age groups.
- **Sex:** Men, Women.
- **Time:** Quarters.
4. Click "Show table" and then "Download" to save as CSV.

## Processing Instructions
To process the Denmark Demographics data and generate statistical variables, use the following command:

**For Data Run (Quarterly Run)**
```python ../../tools/statvar_importer/stat_var_processor.py \
--input_data='gs://unresolved_mcf/country/denmark/input_files/population_quarterly_region_time_marital_status_input.csv' \
--pv_map='population_quartely_region_time_marital_status_pvmap.csv' \
--output_path='population_quartely_region_time_marital_status_output' \
--config_file='denmark_demographics_metadata.csv' \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
```
**For Data Run (Annual Run)**
```python ../../tools/statvar_importer/stat_var_processor.py \
--input_data='gs://unresolved_mcf/country/denmark/input_files/population_sex_age_time_input.csv' \
--pv_map='population_sex_age_time_pvmap.csv' \
--output_path='population_sex_age_time_output' \
--config_file='denmark_demographics_metadata.csv' \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
```

This generates the following output files for the first time run:
- output.csv
- output_stat_vars_schema.mcf
- output_stat_vars.mcf
- output.tmcf

## Data Quality Checks and Validation
Validation is performed using the Data Commons import tool:

```bash
java -jar datacommons-import-tool-jar-with-dependencies.jar lint \
output_stat_vars_schema.mcf \
output.csv \
output.tmcf \
output_stat_vars.mcf
```

The tool generates a `report.json`, `summary_report.csv`, and `summary_report.html` which can be used to identify errors or warnings in the generated data.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
parameter,value
output_columns,"observationDate,value,observationAbout,variableMeasured"
dc_api_root,https://api.datacommons.org
Comment thread
balit-raibot marked this conversation as resolved.
Comment thread
balit-raibot marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an 'AS IS' BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script downloads quarterly population data from the Statistics Denmark API (Statbank).
It fetches demographic data including region, sex, age, and marital status,
processes the JSON-STAT response into a flat structure, and saves it as a CSV.
"""

import pandas as pd
import itertools
import os
from absl import logging
from statbank_utils import find_key_recursive, fetch_statbank_api

logging.set_verbosity(logging.INFO)

# --- CONFIGURATION ---
# url: API endpoint for Statistics Denmark.
# output_dir: Local path for the generated CSV.
# table_id: FOLK1A (Population at the first day of the quarter).
url = "https://api.statbank.dk/v1/data"
output_dir = "./input_files/"
table_id = "FOLK1A"

def main():
"""
Orchestrates the data extraction, transformation, and loading (ETL) process.

1. Fetches data via a POST request using fetch_statbank_api.
2. Parses the nested JSON-STAT structure into dimensions and values.
3. Reconstructs the dataset using a Cartesian product of labels.
4. Cleanses and standardizes labels for downstream use.
"""
logging.info("--- Starting Statistics Denmark data extraction ---")

if not os.path.exists(output_dir):
os.makedirs(output_dir)
logging.info(f"Created output directory: {output_dir}")

# Define the API query payload
# '*' is a wildcard requesting all sub-categories for the given variable.
payload = {
"table": table_id,
"format": "JSONSTAT",
"lang": "en",
"variables": [
{"code": "OMRÅDE", "values": ["000"]},
{"code": "KØN", "values": ["*"]},
{"code": "ALDER", "values": ["*"]},
{"code": "CIVILSTAND", "values": ["*"]},
{"code": "Tid", "values": ["*"]}
]
}

try:
response = fetch_statbank_api(url, table_id, payload)
full_data = response.json()

dims = find_key_recursive(full_data, 'dimension')
vals = find_key_recursive(full_data, 'value')

if dims and vals:
logging.info("Successfully retrieved dimensions and values. Processing...")
ids = find_key_recursive(full_data, 'id') or list(dims.keys())
role = find_key_recursive(full_data, 'role') or {}
metric_ids = role.get('metric', [])

# Extract readable labels for each dimension to prepare for product calculation
dim_list, col_names = [], []
for d_id in ids:
if d_id in metric_ids or d_id.lower() in ['indhold', 'contents']:
continue
labels = dims[d_id]['category']['label']
dim_list.append(list(labels.values()))
col_names.append(d_id)

# JSON-STAT values are flat; we align them by creating a Cartesian product
# of all dimension labels.
df = pd.DataFrame(list(itertools.product(*dim_list)), columns=col_names)
df['Value'] = vals

# Standardize Danish column names to English
df = df.rename(columns={
'OMRÅDE': 'Region', 'ALDER': 'Age',
'CIVILSTAND': 'Marital_Status', 'Tid': 'Quarter', 'KØN': 'Sex'
})

df.loc[df['Sex'] == 'Total', 'Sex'] = 'Gender_Total'
df.loc[df['Marital_Status'] == 'Total', 'Marital_Status'] = 'Marital_Total'

filename = 'population_quarterly_region_time_marital_status_input.csv'
save_path = os.path.join(output_dir, filename)
df.to_csv(save_path, index=False)
logging.info(f"Execution successful! Saved {len(df)} rows to {save_path}")
except Exception as e:
logging.fatal(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an 'AS IS' BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script extracts population data by sex, age, and year from the Statistics
Denmark API using the BULK format. It performs dynamic sorting on demographic
categories (ensuring 'Total' values appear first) and pivots the data to a
wide-format time series before saving to CSV.
"""

import pandas as pd
import os
import re
from io import StringIO
from absl import logging
from statbank_utils import fetch_statbank_api

# Set logging verbosity
logging.set_verbosity(logging.INFO)

# --- CONFIGURATION ---
url = "https://api.statbank.dk/v1/data"
output_dir = "./input_files/"
table_id = "BEFOLK2"

def get_age_rank(age_str):
"""Assigns a numerical rank to age strings to facilitate correct sorting."""
age_str = str(age_str).lower()
if 'total' in age_str:
return -1
nums = re.findall(r'\d+', age_str)
return int(nums[0]) if nums else 999

def main():
"""Main orchestration function to fetch, sort, pivot, and save the population data."""
logging.info("--- Starting Statistics Denmark BEFOLK2 data extraction ---")

if not os.path.exists(output_dir):
os.makedirs(output_dir)
logging.info(f"Created output directory: {output_dir}")

payload = {
"table": table_id,
"format": "BULK",
"lang": "en",
"variables": [
{"code": "KØN", "values": ["*"]},
{"code": "ALDER", "values": ["*"]},
{"code": "Tid", "values": ["*"]}
]
}

try:
# Use modularized download logic
response = fetch_statbank_api(url, table_id, payload)

if response.status_code == 200:
# Process the semicolon-separated bulk response
df = pd.read_csv(StringIO(response.text), sep=';')
sex_col, age_col, time_col, val_col = df.columns

# RESTORED: Printing the total number of rows processed
logging.info(f"Data received. Processing {len(df)} rows.")

# 1. DYNAMIC SEX SORTING
sex_order = sorted(df[sex_col].unique(), key=lambda x: 0 if 'total' in str(x).lower() else 1)
df[sex_col] = pd.Categorical(df[sex_col], categories=sex_order, ordered=True)

# 2. DYNAMIC AGE SORTING
df['age_sort'] = df[age_col].apply(get_age_rank)

# 3. DYNAMIC YEAR SORTING
df[time_col] = df[time_col].apply(lambda x: int(re.search(r'\d+', str(x)).group()))

# Sort and Pivot
df = df.sort_values([sex_col, 'age_sort', time_col])
df_pivot = df.pivot_table(
index=[sex_col, age_col],
columns=time_col,
values=val_col,
aggfunc='first',
sort=False
).reset_index()

df_pivot = df_pivot.rename(columns={'ALDER': 'Age', 'KØN': 'Sex'})

# --- SAVE ---
filename = "population_sex_age_time_input.csv"
save_path = os.path.join(output_dir, filename)
df_pivot.to_csv(save_path, index=False, encoding='utf-8-sig')
logging.info(f"File saved successfully: {save_path}")

else:
logging.error(f"Request failed with status code: {response.status_code}")

except Exception as e:
logging.fatal(f"An unexpected error occurred during processing: {e}")

logging.info("--- Script execution finished ---")

if __name__ == "__main__":
main()
29 changes: 29 additions & 0 deletions statvar_imports/denmark_demographics/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"import_specifications": [
{
"import_name": "Denmark_Demographics",
"curator_emails": [
"support@datacommons.org"
],
"provenance_url": "https://www.statbank.dk/statbank5a/default.asp?w=1280",
"provenance_description": "Population data for Denmark from Statbank",
"scripts": [
"download_population_quarterly_region_time_marital_status.py",
"download_population_sex_age_time.py",
"../../tools/statvar_importer/stat_var_processor.py --input_data=./input_files/population_quarterly_region_time_marital_status_input.csv --pv_map=./population_quarterly_region_time_marital_status_pvmap.csv --config_file=./denmark_demographics_metadata.csv --output_path=./output/population_quarterly_region_time_marital_status_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf",
"../../tools/statvar_importer/stat_var_processor.py --input_data=./input_files/population_sex_age_time_input.csv --pv_map=./population_sex_age_time_pvmap.csv --config_file=./denmark_demographics_metadata.csv --output_path=./output/population_sex_age_time_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf"
],
"import_inputs": [
{
"template_mcf": "output/population_sex_age_time_output.tmcf",
"cleaned_csv": "output/*_output.csv"
}
],
"source_files": [
"./input_files/*.csv"
],
"user_script_timeout": 36000,
"cron_schedule": "0 10 20 2,5,8,11 *"
}
]
}
Loading
Loading