datacommonsorg · balit-raibot · Mar 24, 2026 · Dec 11, 2025 · Jan 26, 2026 · Jan 26, 2026
diff --git a/statvar_imports/denmark_demographics/README.md b/statvar_imports/denmark_demographics/README.md
@@ -0,0 +1,68 @@
+# Statistics Denmark Demographics Dataset
+## Overview
+This dataset contains demographic statistics for the population of Denmark, sourced from Statistics Denmark. It includes two primary datasets covering quarterly and annual population breakdowns across various dimensions like geography (regions and municipalities), sex, age, and marital status.
+
+The import covers:
+- **Population (Quarterly):** Population count by region, marital status, age, and sex at the first day of each quarter (Table FOLK1A).
+- **Population (Annual):** Population count by sex and age groups.
+
+Type of place: Country
+
+## Data Source
+**Source URL:**
+- Main Portal: https://www.statbank.dk/statbank5a/default.asp?w=1396
+- Specific Table (FOLK1A): https://www.statbank.dk/FOLK1A
+
+**Provenance Description:**
+The data is provided by Statistics Denmark, the central authority for Danish statistics. The population figures are derived from the Central Person Register (CPR) and reflect the population residing in Denmark on the first day of the period.
+
+## How To Download Input Data
+To download the data manually:
+1. Go to the [StatBank Denmark Portal](https://www.statbank.dk/statbank5a/default.asp?w=1396).
+2. Browse or search for the desired population tables. For quarterly demographics, search for table **FOLK1A** (Population at the first day of the quarter).
+3. Select the desired variables:
+   - **Region:** All Denmark.
+   - **Marital Status:** Total, Never married, Married/separated, Widowed, Divorced.
+   - **Age:** Individual ages or age groups.
+   - **Sex:** Men, Women.
+   - **Time:** Quarters.
+4. Click "Show table" and then "Download" to save as CSV.
+
+## Processing Instructions
+To process the Denmark Demographics data and generate statistical variables, use the following command:
+
+**For Data Run (Quarterly Run)**
+```python ../../tools/statvar_importer/stat_var_processor.py \
+    --input_data='gs://unresolved_mcf/country/denmark/input_files/population_quarterly_region_time_marital_status_input.csv' \
+    --pv_map='population_quartely_region_time_marital_status_pvmap.csv' \
+    --output_path='population_quartely_region_time_marital_status_output' \
+    --config_file='denmark_demographics_metadata.csv' \
+    --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
+```
+**For Data Run (Annual Run)**
+```python ../../tools/statvar_importer/stat_var_processor.py \
+    --input_data='gs://unresolved_mcf/country/denmark/input_files/population_sex_age_time_input.csv' \
+    --pv_map='population_sex_age_time_pvmap.csv' \
+    --output_path='population_sex_age_time_output' \
+    --config_file='denmark_demographics_metadata.csv' \
+    --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
+```
+
+This generates the following output files for the first time run:
+- output.csv
+- output_stat_vars_schema.mcf
+- output_stat_vars.mcf
+- output.tmcf
+
+## Data Quality Checks and Validation
+Validation is performed using the Data Commons import tool:
+
+```bash
+java -jar datacommons-import-tool-jar-with-dependencies.jar lint \
+    output_stat_vars_schema.mcf \
+    output.csv \
+    output.tmcf \
+    output_stat_vars.mcf  
+```
+
+The tool generates a `report.json`, `summary_report.csv`, and `summary_report.html` which can be used to identify errors or warnings in the generated data.
diff --git a/statvar_imports/denmark_demographics/denmark_demographics_metadata.csv b/statvar_imports/denmark_demographics/denmark_demographics_metadata.csv
@@ -0,0 +1,3 @@
+parameter,value
+output_columns,"observationDate,value,observationAbout,variableMeasured"
+dc_api_root,https://api.datacommons.org
diff --git a/..._imports/denmark_demographics/download_population_quarterly_region_time_marital_status.py b/..._imports/denmark_demographics/download_population_quarterly_region_time_marital_status.py
@@ -0,0 +1,111 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This script downloads quarterly population data from the Statistics Denmark API (Statbank).
+It fetches demographic data including region, sex, age, and marital status, 
+processes the JSON-STAT response into a flat structure, and saves it as a CSV.
+"""
+
+import pandas as pd
+import itertools
+import os
+from absl import logging
+from statbank_utils import find_key_recursive, fetch_statbank_api
+
+logging.set_verbosity(logging.INFO)
+
+# --- CONFIGURATION ---
+# url: API endpoint for Statistics Denmark.
+# output_dir: Local path for the generated CSV.
+# table_id: FOLK1A (Population at the first day of the quarter).
+url = "https://api.statbank.dk/v1/data"
+output_dir = "./input_files/"
+table_id = "FOLK1A"
+
+def main():
+    """
+    Orchestrates the data extraction, transformation, and loading (ETL) process.
+
+    1. Fetches data via a POST request using fetch_statbank_api.
+    2. Parses the nested JSON-STAT structure into dimensions and values.
+    3. Reconstructs the dataset using a Cartesian product of labels.
+    4. Cleanses and standardizes labels for downstream use.
+    """
+    logging.info("--- Starting Statistics Denmark data extraction ---")
+
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+        logging.info(f"Created output directory: {output_dir}")
+
+    # Define the API query payload
+    # '*' is a wildcard requesting all sub-categories for the given variable.
+    payload = {
+        "table": table_id,
+        "format": "JSONSTAT",
+        "lang": "en",
+        "variables": [
+            {"code": "OMRÅDE", "values": ["000"]},
+            {"code": "KØN", "values": ["*"]},
+            {"code": "ALDER", "values": ["*"]},
+            {"code": "CIVILSTAND", "values": ["*"]},
+            {"code": "Tid", "values": ["*"]}  
+        ]
+    }
+
+    try:
+        response = fetch_statbank_api(url, table_id, payload)
+        full_data = response.json()
+
+        dims = find_key_recursive(full_data, 'dimension')
+        vals = find_key_recursive(full_data, 'value')
+
+        if dims and vals:
+            logging.info("Successfully retrieved dimensions and values. Processing...")
+            ids = find_key_recursive(full_data, 'id') or list(dims.keys())
+            role = find_key_recursive(full_data, 'role') or {}
+            metric_ids = role.get('metric', [])
+
+            # Extract readable labels for each dimension to prepare for product calculation
+            dim_list, col_names = [], []
+            for d_id in ids:
+                if d_id in metric_ids or d_id.lower() in ['indhold', 'contents']: 
+                    continue
+                labels = dims[d_id]['category']['label']
+                dim_list.append(list(labels.values()))
+                col_names.append(d_id)
+
+            # JSON-STAT values are flat; we align them by creating a Cartesian product 
+            # of all dimension labels.
+            df = pd.DataFrame(list(itertools.product(*dim_list)), columns=col_names)
+            df['Value'] = vals
+
+            # Standardize Danish column names to English
+            df = df.rename(columns={
+                'OMRÅDE': 'Region', 'ALDER': 'Age', 
+                'CIVILSTAND': 'Marital_Status', 'Tid': 'Quarter', 'KØN': 'Sex'
+            })
+
+            df.loc[df['Sex'] == 'Total', 'Sex'] = 'Gender_Total'
+            df.loc[df['Marital_Status'] == 'Total', 'Marital_Status'] = 'Marital_Total'
+
+            filename = 'population_quarterly_region_time_marital_status_input.csv'
+            save_path = os.path.join(output_dir, filename)
+            df.to_csv(save_path, index=False)
+            logging.info(f"Execution successful! Saved {len(df)} rows to {save_path}")
+    except Exception as e:
+        logging.fatal(f"An unexpected error occurred: {e}")
+
+if __name__ == "__main__":
+    main()
diff --git a/statvar_imports/denmark_demographics/download_population_sex_age_time.py b/statvar_imports/denmark_demographics/download_population_sex_age_time.py
@@ -0,0 +1,113 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#          https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This script extracts population data by sex, age, and year from the Statistics 
+Denmark API using the BULK format. It performs dynamic sorting on demographic 
+categories (ensuring 'Total' values appear first) and pivots the data to a 
+wide-format time series before saving to CSV.
+"""
+
+import pandas as pd
+import os
+import re
+from io import StringIO
+from absl import logging
+from statbank_utils import fetch_statbank_api
+
+# Set logging verbosity
+logging.set_verbosity(logging.INFO)
+
+# --- CONFIGURATION ---
+url = "https://api.statbank.dk/v1/data"
+output_dir = "./input_files/"
+table_id = "BEFOLK2"
+
+def get_age_rank(age_str):
+    """Assigns a numerical rank to age strings to facilitate correct sorting."""
+    age_str = str(age_str).lower()
+    if 'total' in age_str:
+        return -1
+    nums = re.findall(r'\d+', age_str)
+    return int(nums[0]) if nums else 999
+
+def main():
+    """Main orchestration function to fetch, sort, pivot, and save the population data."""
+    logging.info("--- Starting Statistics Denmark BEFOLK2 data extraction ---")
+
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+        logging.info(f"Created output directory: {output_dir}")
+
+    payload = {
+       "table": table_id,
+       "format": "BULK",
+       "lang": "en",
+       "variables": [
+          {"code": "KØN", "values": ["*"]},
+          {"code": "ALDER", "values": ["*"]},
+          {"code": "Tid", "values": ["*"]}
+       ]
+    }
+
+    try:
+        # Use modularized download logic
+        response = fetch_statbank_api(url, table_id, payload)
+
+        if response.status_code == 200:
+            # Process the semicolon-separated bulk response
+            df = pd.read_csv(StringIO(response.text), sep=';')
+            sex_col, age_col, time_col, val_col = df.columns
+
+            # RESTORED: Printing the total number of rows processed
+            logging.info(f"Data received. Processing {len(df)} rows.")
+
+            # 1. DYNAMIC SEX SORTING
+            sex_order = sorted(df[sex_col].unique(), key=lambda x: 0 if 'total' in str(x).lower() else 1)
+            df[sex_col] = pd.Categorical(df[sex_col], categories=sex_order, ordered=True)
+
+            # 2. DYNAMIC AGE SORTING
+            df['age_sort'] = df[age_col].apply(get_age_rank)
+
+            # 3. DYNAMIC YEAR SORTING
+            df[time_col] = df[time_col].apply(lambda x: int(re.search(r'\d+', str(x)).group()))
+
+            # Sort and Pivot
+            df = df.sort_values([sex_col, 'age_sort', time_col])
+            df_pivot = df.pivot_table(
+                index=[sex_col, age_col],
+                columns=time_col,
+                values=val_col,
+                aggfunc='first',
+                sort=False 
+            ).reset_index()
+
+            df_pivot = df_pivot.rename(columns={'ALDER': 'Age', 'KØN': 'Sex'})
+
+            # --- SAVE ---
+            filename = "population_sex_age_time_input.csv"
+            save_path = os.path.join(output_dir, filename)
+            df_pivot.to_csv(save_path, index=False, encoding='utf-8-sig')
+            logging.info(f"File saved successfully: {save_path}")
+
+        else:
+            logging.error(f"Request failed with status code: {response.status_code}")
+
+    except Exception as e:
+        logging.fatal(f"An unexpected error occurred during processing: {e}")
+
+    logging.info("--- Script execution finished ---")
+
+if __name__ == "__main__":
+    main()
diff --git a/statvar_imports/denmark_demographics/manifest.json b/statvar_imports/denmark_demographics/manifest.json
@@ -0,0 +1,29 @@
+{
+    "import_specifications": [
+        {
+            "import_name": "Denmark_Demographics",
+            "curator_emails": [
+                "support@datacommons.org"
+            ],
+            "provenance_url": "https://www.statbank.dk/statbank5a/default.asp?w=1280",
+            "provenance_description": "Population data for Denmark from Statbank",
+            "scripts": [
+                "download_population_quarterly_region_time_marital_status.py",
+                "download_population_sex_age_time.py",
+                "../../tools/statvar_importer/stat_var_processor.py --input_data=./input_files/population_quarterly_region_time_marital_status_input.csv --pv_map=./population_quarterly_region_time_marital_status_pvmap.csv --config_file=./denmark_demographics_metadata.csv --output_path=./output/population_quarterly_region_time_marital_status_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf",
+                "../../tools/statvar_importer/stat_var_processor.py --input_data=./input_files/population_sex_age_time_input.csv --pv_map=./population_sex_age_time_pvmap.csv --config_file=./denmark_demographics_metadata.csv --output_path=./output/population_sex_age_time_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf"
+            ],
+            "import_inputs": [
+                {
+                    "template_mcf": "output/population_sex_age_time_output.tmcf",
+                    "cleaned_csv": "output/*_output.csv"
+                }
+            ],
+            "source_files": [
+                "./input_files/*.csv"
+            ],
+            "user_script_timeout": 36000,
+            "cron_schedule": "0 10 20 2,5,8,11 *"
+        }
+    ]
+}