-
Notifications
You must be signed in to change notification settings - Fork 141
461839767_Statistics_Denmark_Demographics #1769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
balit-raibot
merged 33 commits into
datacommonsorg:master
from
balit-raibot:denmark_census
Mar 24, 2026
Merged
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
83cf3f2
adding denmark data
balit-raibot 9fc9408
Merge branch 'master' into denmark_census
balit-raibot ba12706
standardizing file names
balit-raibot 0515e4a
removing old files
balit-raibot af14a8d
added two imports
balit-raibot ced166d
resolving comments
balit-raibot 3e62f91
ignored age above 105
niveditasing e68d166
ignored age above 105
niveditasing c2835a1
removing stat_vars.mcf file
smarthg-gi 3cb8ed6
Updating directory structure
smarthg-gi 5fc020e
Updating directory structure
smarthg-gi 7bac0f6
Adding Readme file
smarthg-gi 076957b
reorg folder structure
balit-raibot 0467aba
added future dates
balit-raibot 2f333cf
updated README
balit-raibot 4c9b6fc
Merge branch 'master' into denmark_census
balit-raibot ed1713b
Merge branch 'master' into denmark_census
balit-raibot 16ec59f
Merge branch 'master' into denmark_census
balit-raibot 83c79ef
converted manual downloading to automatic
balit-raibot b6bd1a6
Merge branch 'denmark_census' of https://github.com/balit-raibot/data…
balit-raibot 7780a0a
adding manifest.json
balit-raibot fe2e2d3
adding manifest.json
balit-raibot f905144
added a flag to download all data instead of current and previous year
balit-raibot 4eb4232
increased timeout from 1 hour to 10 hours
balit-raibot 2d4e5e3
refactored download script to download country level stats
balit-raibot b90c413
Merge branch 'master' into denmark_census
balit-raibot bb91162
modified manifest resource limits
balit-raibot 1a276dc
Merge branch 'denmark_census' of https://github.com/balit-raibot/data…
balit-raibot b4afc92
Merge branch 'master' into denmark_census
balit-raibot 44dab64
Merge branch 'master' into denmark_census
balit-raibot 9c0b100
addressing comments to refactor download script
balit-raibot 6e4e437
Merge branch 'denmark_census' of https://github.com/balit-raibot/data…
balit-raibot b89e793
modularising the code
balit-raibot File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # Statistics Denmark Demographics Dataset | ||
| ## Overview | ||
| This dataset contains demographic statistics for the population of Denmark, sourced from Statistics Denmark. It includes two primary datasets covering quarterly and annual population breakdowns across various dimensions like geography (regions and municipalities), sex, age, and marital status. | ||
|
|
||
| The import covers: | ||
| - **Population (Quarterly):** Population count by region, marital status, age, and sex at the first day of each quarter (Table FOLK1A). | ||
| - **Population (Annual):** Population count by sex and age groups. | ||
|
|
||
| Type of place: Country | ||
|
|
||
| ## Data Source | ||
| **Source URL:** | ||
| - Main Portal: https://www.statbank.dk/statbank5a/default.asp?w=1396 | ||
| - Specific Table (FOLK1A): https://www.statbank.dk/FOLK1A | ||
|
|
||
| **Provenance Description:** | ||
| The data is provided by Statistics Denmark, the central authority for Danish statistics. The population figures are derived from the Central Person Register (CPR) and reflect the population residing in Denmark on the first day of the period. | ||
|
|
||
| ## How To Download Input Data | ||
| To download the data manually: | ||
| 1. Go to the [StatBank Denmark Portal](https://www.statbank.dk/statbank5a/default.asp?w=1396). | ||
| 2. Browse or search for the desired population tables. For quarterly demographics, search for table **FOLK1A** (Population at the first day of the quarter). | ||
| 3. Select the desired variables: | ||
| - **Region:** All Denmark. | ||
| - **Marital Status:** Total, Never married, Married/separated, Widowed, Divorced. | ||
| - **Age:** Individual ages or age groups. | ||
| - **Sex:** Men, Women. | ||
| - **Time:** Quarters. | ||
| 4. Click "Show table" and then "Download" to save as CSV. | ||
|
|
||
| ## Processing Instructions | ||
| To process the Denmark Demographics data and generate statistical variables, use the following command: | ||
|
|
||
| **For Data Run (Quarterly Run)** | ||
| ```python ../../tools/statvar_importer/stat_var_processor.py \ | ||
| --input_data='gs://unresolved_mcf/country/denmark/input_files/population_quarterly_region_time_marital_status_input.csv' \ | ||
| --pv_map='population_quartely_region_time_marital_status_pvmap.csv' \ | ||
| --output_path='population_quartely_region_time_marital_status_output' \ | ||
| --config_file='denmark_demographics_metadata.csv' \ | ||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | ||
| ``` | ||
| **For Data Run (Annual Run)** | ||
| ```python ../../tools/statvar_importer/stat_var_processor.py \ | ||
| --input_data='gs://unresolved_mcf/country/denmark/input_files/population_sex_age_time_input.csv' \ | ||
| --pv_map='population_sex_age_time_pvmap.csv' \ | ||
| --output_path='population_sex_age_time_output' \ | ||
| --config_file='denmark_demographics_metadata.csv' \ | ||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | ||
| ``` | ||
|
|
||
| This generates the following output files for the first time run: | ||
| - output.csv | ||
| - output_stat_vars_schema.mcf | ||
| - output_stat_vars.mcf | ||
| - output.tmcf | ||
|
|
||
| ## Data Quality Checks and Validation | ||
| Validation is performed using the Data Commons import tool: | ||
|
|
||
| ```bash | ||
| java -jar datacommons-import-tool-jar-with-dependencies.jar lint \ | ||
| output_stat_vars_schema.mcf \ | ||
| output.csv \ | ||
| output.tmcf \ | ||
| output_stat_vars.mcf | ||
| ``` | ||
|
|
||
| The tool generates a `report.json`, `summary_report.csv`, and `summary_report.html` which can be used to identify errors or warnings in the generated data. |
3 changes: 3 additions & 0 deletions
3
statvar_imports/denmark_demographics/denmark_demographics_metadata.csv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| parameter,value | ||
| output_columns,"observationDate,value,observationAbout,variableMeasured" | ||
| dc_api_root,https://api.datacommons.org |
111 changes: 111 additions & 0 deletions
111
..._imports/denmark_demographics/download_population_quarterly_region_time_marital_status.py
|
balit-raibot marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| # Copyright 2026 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the 'License'); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # https://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an 'AS IS' BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| This script downloads quarterly population data from the Statistics Denmark API (Statbank). | ||
| It fetches demographic data including region, sex, age, and marital status, | ||
| processes the JSON-STAT response into a flat structure, and saves it as a CSV. | ||
| """ | ||
|
|
||
| import pandas as pd | ||
| import itertools | ||
| import os | ||
| from absl import logging | ||
| from statbank_utils import find_key_recursive, fetch_statbank_api | ||
|
|
||
| logging.set_verbosity(logging.INFO) | ||
|
|
||
| # --- CONFIGURATION --- | ||
| # url: API endpoint for Statistics Denmark. | ||
| # output_dir: Local path for the generated CSV. | ||
| # table_id: FOLK1A (Population at the first day of the quarter). | ||
| url = "https://api.statbank.dk/v1/data" | ||
| output_dir = "./input_files/" | ||
| table_id = "FOLK1A" | ||
|
|
||
| def main(): | ||
| """ | ||
| Orchestrates the data extraction, transformation, and loading (ETL) process. | ||
|
|
||
| 1. Fetches data via a POST request using fetch_statbank_api. | ||
| 2. Parses the nested JSON-STAT structure into dimensions and values. | ||
| 3. Reconstructs the dataset using a Cartesian product of labels. | ||
| 4. Cleanses and standardizes labels for downstream use. | ||
| """ | ||
| logging.info("--- Starting Statistics Denmark data extraction ---") | ||
|
|
||
| if not os.path.exists(output_dir): | ||
| os.makedirs(output_dir) | ||
| logging.info(f"Created output directory: {output_dir}") | ||
|
|
||
| # Define the API query payload | ||
| # '*' is a wildcard requesting all sub-categories for the given variable. | ||
| payload = { | ||
| "table": table_id, | ||
| "format": "JSONSTAT", | ||
| "lang": "en", | ||
| "variables": [ | ||
| {"code": "OMRÅDE", "values": ["000"]}, | ||
| {"code": "KØN", "values": ["*"]}, | ||
| {"code": "ALDER", "values": ["*"]}, | ||
| {"code": "CIVILSTAND", "values": ["*"]}, | ||
| {"code": "Tid", "values": ["*"]} | ||
| ] | ||
| } | ||
|
|
||
| try: | ||
| response = fetch_statbank_api(url, table_id, payload) | ||
| full_data = response.json() | ||
|
|
||
| dims = find_key_recursive(full_data, 'dimension') | ||
| vals = find_key_recursive(full_data, 'value') | ||
|
|
||
| if dims and vals: | ||
| logging.info("Successfully retrieved dimensions and values. Processing...") | ||
| ids = find_key_recursive(full_data, 'id') or list(dims.keys()) | ||
| role = find_key_recursive(full_data, 'role') or {} | ||
| metric_ids = role.get('metric', []) | ||
|
|
||
| # Extract readable labels for each dimension to prepare for product calculation | ||
| dim_list, col_names = [], [] | ||
| for d_id in ids: | ||
| if d_id in metric_ids or d_id.lower() in ['indhold', 'contents']: | ||
| continue | ||
| labels = dims[d_id]['category']['label'] | ||
| dim_list.append(list(labels.values())) | ||
| col_names.append(d_id) | ||
|
|
||
| # JSON-STAT values are flat; we align them by creating a Cartesian product | ||
| # of all dimension labels. | ||
| df = pd.DataFrame(list(itertools.product(*dim_list)), columns=col_names) | ||
| df['Value'] = vals | ||
|
|
||
| # Standardize Danish column names to English | ||
| df = df.rename(columns={ | ||
| 'OMRÅDE': 'Region', 'ALDER': 'Age', | ||
| 'CIVILSTAND': 'Marital_Status', 'Tid': 'Quarter', 'KØN': 'Sex' | ||
| }) | ||
|
|
||
| df.loc[df['Sex'] == 'Total', 'Sex'] = 'Gender_Total' | ||
| df.loc[df['Marital_Status'] == 'Total', 'Marital_Status'] = 'Marital_Total' | ||
|
|
||
| filename = 'population_quarterly_region_time_marital_status_input.csv' | ||
| save_path = os.path.join(output_dir, filename) | ||
| df.to_csv(save_path, index=False) | ||
| logging.info(f"Execution successful! Saved {len(df)} rows to {save_path}") | ||
| except Exception as e: | ||
| logging.fatal(f"An unexpected error occurred: {e}") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() |
113 changes: 113 additions & 0 deletions
113
statvar_imports/denmark_demographics/download_population_sex_age_time.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| # Copyright 2026 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the 'License'); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # https://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an 'AS IS' BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| This script extracts population data by sex, age, and year from the Statistics | ||
| Denmark API using the BULK format. It performs dynamic sorting on demographic | ||
| categories (ensuring 'Total' values appear first) and pivots the data to a | ||
| wide-format time series before saving to CSV. | ||
| """ | ||
|
|
||
| import pandas as pd | ||
| import os | ||
| import re | ||
| from io import StringIO | ||
| from absl import logging | ||
| from statbank_utils import fetch_statbank_api | ||
|
|
||
| # Set logging verbosity | ||
| logging.set_verbosity(logging.INFO) | ||
|
|
||
| # --- CONFIGURATION --- | ||
| url = "https://api.statbank.dk/v1/data" | ||
| output_dir = "./input_files/" | ||
| table_id = "BEFOLK2" | ||
|
|
||
| def get_age_rank(age_str): | ||
| """Assigns a numerical rank to age strings to facilitate correct sorting.""" | ||
| age_str = str(age_str).lower() | ||
| if 'total' in age_str: | ||
| return -1 | ||
| nums = re.findall(r'\d+', age_str) | ||
| return int(nums[0]) if nums else 999 | ||
|
|
||
| def main(): | ||
| """Main orchestration function to fetch, sort, pivot, and save the population data.""" | ||
| logging.info("--- Starting Statistics Denmark BEFOLK2 data extraction ---") | ||
|
|
||
| if not os.path.exists(output_dir): | ||
| os.makedirs(output_dir) | ||
| logging.info(f"Created output directory: {output_dir}") | ||
|
|
||
| payload = { | ||
| "table": table_id, | ||
| "format": "BULK", | ||
| "lang": "en", | ||
| "variables": [ | ||
| {"code": "KØN", "values": ["*"]}, | ||
| {"code": "ALDER", "values": ["*"]}, | ||
| {"code": "Tid", "values": ["*"]} | ||
| ] | ||
| } | ||
|
|
||
| try: | ||
| # Use modularized download logic | ||
| response = fetch_statbank_api(url, table_id, payload) | ||
|
|
||
| if response.status_code == 200: | ||
| # Process the semicolon-separated bulk response | ||
| df = pd.read_csv(StringIO(response.text), sep=';') | ||
| sex_col, age_col, time_col, val_col = df.columns | ||
|
|
||
| # RESTORED: Printing the total number of rows processed | ||
| logging.info(f"Data received. Processing {len(df)} rows.") | ||
|
|
||
| # 1. DYNAMIC SEX SORTING | ||
| sex_order = sorted(df[sex_col].unique(), key=lambda x: 0 if 'total' in str(x).lower() else 1) | ||
| df[sex_col] = pd.Categorical(df[sex_col], categories=sex_order, ordered=True) | ||
|
|
||
| # 2. DYNAMIC AGE SORTING | ||
| df['age_sort'] = df[age_col].apply(get_age_rank) | ||
|
|
||
| # 3. DYNAMIC YEAR SORTING | ||
| df[time_col] = df[time_col].apply(lambda x: int(re.search(r'\d+', str(x)).group())) | ||
|
|
||
| # Sort and Pivot | ||
| df = df.sort_values([sex_col, 'age_sort', time_col]) | ||
| df_pivot = df.pivot_table( | ||
| index=[sex_col, age_col], | ||
| columns=time_col, | ||
| values=val_col, | ||
| aggfunc='first', | ||
| sort=False | ||
| ).reset_index() | ||
|
|
||
| df_pivot = df_pivot.rename(columns={'ALDER': 'Age', 'KØN': 'Sex'}) | ||
|
|
||
| # --- SAVE --- | ||
| filename = "population_sex_age_time_input.csv" | ||
| save_path = os.path.join(output_dir, filename) | ||
| df_pivot.to_csv(save_path, index=False, encoding='utf-8-sig') | ||
| logging.info(f"File saved successfully: {save_path}") | ||
|
|
||
| else: | ||
| logging.error(f"Request failed with status code: {response.status_code}") | ||
|
|
||
| except Exception as e: | ||
| logging.fatal(f"An unexpected error occurred during processing: {e}") | ||
|
|
||
| logging.info("--- Script execution finished ---") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| { | ||
| "import_specifications": [ | ||
| { | ||
| "import_name": "Denmark_Demographics", | ||
| "curator_emails": [ | ||
| "support@datacommons.org" | ||
| ], | ||
| "provenance_url": "https://www.statbank.dk/statbank5a/default.asp?w=1280", | ||
| "provenance_description": "Population data for Denmark from Statbank", | ||
| "scripts": [ | ||
| "download_population_quarterly_region_time_marital_status.py", | ||
| "download_population_sex_age_time.py", | ||
| "../../tools/statvar_importer/stat_var_processor.py --input_data=./input_files/population_quarterly_region_time_marital_status_input.csv --pv_map=./population_quarterly_region_time_marital_status_pvmap.csv --config_file=./denmark_demographics_metadata.csv --output_path=./output/population_quarterly_region_time_marital_status_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf", | ||
| "../../tools/statvar_importer/stat_var_processor.py --input_data=./input_files/population_sex_age_time_input.csv --pv_map=./population_sex_age_time_pvmap.csv --config_file=./denmark_demographics_metadata.csv --output_path=./output/population_sex_age_time_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf" | ||
| ], | ||
| "import_inputs": [ | ||
| { | ||
| "template_mcf": "output/population_sex_age_time_output.tmcf", | ||
| "cleaned_csv": "output/*_output.csv" | ||
| } | ||
| ], | ||
| "source_files": [ | ||
| "./input_files/*.csv" | ||
| ], | ||
| "user_script_timeout": 36000, | ||
| "cron_schedule": "0 10 20 2,5,8,11 *" | ||
| } | ||
| ] | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.