Add EUVD mirror pipeline by Samk1710 · Pull Request #1 · aboutcode-org/aboutcode-mirror-euvd

Samk1710 · 2025-12-08T14:55:16Z

EUVD Mirror Pipeline

The script has two modes:

Backfill mode– pulls all historical data. This is intended run once locally and resultant data can be committed separately.
Daily mode – collects updates for the previous day. GitHub Actions works on this mode.

As of now the PR focuses on the code, once reviewed and approved I can add the locally collected backfill data.

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

sync_catalog.py

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

ziadhany

@Samk1710 The code looks good 🚀. Some nits for your consideration.

sync_catalog.py

Samk1710 · 2025-12-15T22:55:52Z

@Samk1710 The code looks good 🚀. Some nits for your consideration.

Thanks @ziadhany for the review, will make changes accordingly. Thanks a lot!

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 · 2025-12-19T13:54:41Z

I have made the changes as per suggestions. If any other changes are required, do let me know. Looking forward!

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

ziadhany

LGTM, please review once you have a time @keshav-space

Samk1710 · 2025-12-30T16:13:29Z

@TG1999 @keshav-space
could you kindly review this pr.

keshav-space

Thanks @Samk1710, see comments below.

.github/workflows/sync.yml

sync_catalog.py

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 · 2026-01-06T15:05:22Z

Hi @keshav-space
Thanks for your suggestions.
I’ve pushed the updates addressing all the review comments. Would appreciate a re-review when you have time :)

keshav-space

Thanks @Samk1710, see the comments below. And also please explain why you need to break the request into individual months in backfill_from_year?

sync_catalog.py

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 · 2026-01-14T20:47:32Z

Thanks @Samk1710, see the comments below. And also please explain why you need to break the request into individual months in backfill_from_year?

Thanks @keshav-space for the review and clarifications. I have implemented all the suggestions.
As per your question, I have answered it in the thread. #1 (comment)

keshav-space · 2026-02-06T17:16:47Z

@Samk1710 here is response to #1 (comment)

Thanks for the question! There are a few reasons I went with monthly chunks. Have listed them down :

Better data organization and structure
Organizing files by year/month/ directories lets us structure the collected data naturally, making it easy to locate specific time periods. It follows similar structure to some advisory sources I found while researching. For example the Github Advisory Database- https://github.com/github/advisory-database

This does not explain the need for monthly iteration. In every EUVD advisory we get datePublished which can and should be used to organize advisory based on year and month, we don't need monthly chunking for this.

API pagination constraints
The API has a max page size =100, so we need multiple fetches regardless. Breaking it into months lets us actually use the API’s date range filters (fromUpdatedDate / toUpdatedDate) in a meaningful way, which keeps the fetch logic simple and predictable.

Again this is lot of words which absolutely answers nothing. No this does not make logic simpler you are breaking response into months but you are still using API page parameter on top of it. In essence your doing pagination on top of already paginated API response? why?

Samk1710 · 2026-02-06T21:22:41Z

@keshav-space
Let's say, a given month has 300 advisories updated in that range and we do not iterate over page, then only the first 100 results would ever be fetched as max page size is 100. So, irrespective of how we chunk using the page parameter becomes mandatory given EUVD sometimes updates more than 100 advisories in day. So I used page iteration on top of month iteration logic to cope up with the limitation of response size.

Regarding datePublished, I went for fetching by dateUpdated, because advisories can be updated even after publication.

{
      "id": "EUVD-2009-0007",
      "enisaUuid": "111df945-ce3f-34ea-a73f-59b8cf279dea",
      "description": "Multiple cross-site scripting (XSS) vulnerabilities in action/AttachFile.py in MoinMoin 1.8.2 and earlier allow remote attackers to inject arbitrary web script or HTML via (1) an AttachFile sub-action in the error_msg function or (2) multiple vectors related to package file errors in the upload_form function, different vectors than CVE-2009-0260.",
      "datePublished": "Apr 29, 2009, 6:30:00 PM",
      "dateUpdated": "Aug 17, 2017, 1:30:00 AM",
      "baseScore": 4.3,
      "baseScoreVersion": "2.0",
      "baseScoreVector": "AV:N/AC:M/Au:N/C:N/I:P/A:N",
      "references": "http://secunia.com/advisories/34821\nhttp://www.vupen.com/english/advisories/2009/1119\nhttp://hg.moinmo.in/moin/1.8/rev/5f51246a4df1\nhttp://www.securityfocus.com/bid/34631\nhttp://moinmo.in/SecurityFixes\nhttp://www.ubuntu.com/usn/USN-774-1\nhttp://secunia.com/advisories/35024\nhttp://secunia.com/advisories/34945\nhttp://www.debian.org/security/2009/dsa-1791\nhttps://exchange.xforce.ibmcloud.com/vulnerabilities/50356\nhttps://nvd.nist.gov/vuln/detail/CVE-2009-1482\nhttps://github.com/moinwiki/moin\nhttps://github.com/pypa/advisory-database/tree/main/vulns/moin/PYSEC-2009-6.yaml\nhttps://web.archive.org/web/20140724194431/http://secunia.com/advisories/34945\nhttps://web.archive.org/web/20140803154414/http://secunia.com/advisories/35024\nhttps://web.archive.org/web/20140805081742/http://secunia.com/advisories/34821\nhttps://web.archive.org/web/20200301062001/http://www.securityfocus.com/bid/34631\nhttps://www.debian.org/security/2009/dsa-1791\n",
      "aliases": "CVE-2009-1482\nPYSEC-2009-6\n",
      "assigner": "mitre",
      "epss": 1.71,
      "enisaIdProduct": [
        {
          "id": "a769252c-af9a-3cca-beda-e9bd41f651fb",
          "product": {
            "name": "n/a"
          },
          "product_version": "n/a"
        }
      ],
      "enisaIdVendor": [
        {
          "id": "84619aec-0dc5-36ee-8b70-3727fc55eca9",
          "vendor": {
            "name": "n/a"
          }
        }
      ]
    }

For example, the above advisory was published in 2009 and updated in 2017. So if we use datePublished we will miss the 2017 update.

Also if we keep the dateUpdated fetching and organize by datePublished we have to keep on touching backfill data while daily syncs and is also an extra layer of parsing.

Also for fetching we can't do

A single full-range backfill → not restartable, if it fails mid-way (say around 2021), it must be restarted from the beginning
A daily backfill → restartable but results in an excessive number of API calls for years of history(not very polite to the API)

Hence wee can either do a yearly backfill or a monthly backfill as they would be restartable and have similar API call numbers. I chose monthly as I already mentioned it was how most data sources I laid eyes on were structured.

If you have further questions or want some changes in the code implementation kindly let me know.

keshav-space · 2026-02-16T11:34:06Z

@Samk1710

Let's say, a given month has 300 advisories updated in that range and we do not iterate over page, then only the first 100 results would ever be fetched as max page size is 100. So, irrespective of how we chunk using the page parameter becomes mandatory given EUVD sometimes updates more than 100 advisories in day. So I used page iteration on top of month iteration logic to cope up with the limitation of response size.

Exactly, API response is already paginated so there is no need to do double pagination like chunking response by months or years or anything else.

Regarding datePublished, I went for fetching by dateUpdated, because advisories can be updated even after publication. For example, the above advisory was published in 2009 and updated in 2017. So if we use datePublished we will miss the 2017 update.

Use dateUpdated to fetch response and use datePublished to organize advisory.

Also if we keep the dateUpdated fetching and organize by datePublished we have to keep on touching backfill data while daily syncs and is also an extra layer of parsing.

Also for fetching we can't do

A single full-range backfill → not restartable, if it fails mid-way (say around 2021), it must be restarted from the beginning

A daily backfill → restartable but results in an excessive number of API calls for years of history(not very polite to the API)

Hence wee can either do a yearly backfill or a monthly backfill as they would be restartable and have similar API call numbers. I chose monthly as I already mentioned it was how most data sources I laid eyes on were structured.

If concern is that the pipeline may fail due to a network call, then backfill and daily mode are not solving that problem. It can fail even during a daily run if we get a huge number of updates. Moreover there is no built in mechanism to automatically start from the last failure in the next run.

Get rid of backfill and daily mode. Instead, we should have something simpler like this:

class EUVDAdvisoryMirror(BasePipeline):
    url = "https://euvdservices.enisa.europa.eu/api/search"

    @classmethod
    def steps(cls):
        return (
            cls.load_checkpoint,
            cls.create_session,
            cls.collect_new_advisory,
            cls.save_checkpoint,
        )

load_checkpoint: open the checkpoint.json file, get the last_run date, and use it in the fromUpdatedDate parameter. If there is no checkpoint, then do not use fromUpdatedDate. This way we get all data in the first run, and in subsequent runs we only get incremental updates.
create_session: create request session with 3 retries.
collect_new_advisory: use the params created in load_checkpoint step to get the total advisory count, use the response size and total advisory count to compute the total number of pages to collect, iterate over them, and collect and save all advisories. Each advisory should be stored in an individual file at /advisory/{year}/{month}/{EUVD_ID}.json use datePublished to get the values for year and month. If datePublished is empty or not available, then store the advisory at /advisory/unpublished/{EUVD_ID}.json.
save_checkpoint: update and store last_run date in checkpoint.json file

This way in first run pipeline will automatically collect all advisories, and in subsequent runs it will collect only incremental updates. In case of a pipeline failure, it will always collect data since the last successful run.

Samk1710 · 2026-02-16T13:40:08Z

Thanks @keshav-space for the suggestions. I will update the implementation shortly.

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 · 2026-02-23T18:10:31Z

Hey @keshav-space
I have refactored the pipeline with the last run approach you suggested. Kindly review and let me know if things need adjustments :>

keshav-space

Thanks @Samk1710, see some comments below.

sync_catalog.py

keshav-space · 2026-03-11T10:32:53Z

sync_catalog.py

+        euvd_id = advisory.get("id")
+        if not euvd_id:
+            self.log(f"Advisory missing id, skipping: {advisory}")
+            return
+
+        date_published = advisory.get("datePublished", "")
+        dir_path = self.advisory_dir(date_published)
+
+        if dir_path is None:
+            dir_path = ADVISORY_PATH / "unpublished"
+            self.has_unpublished = True
+
+        dir_path.mkdir(parents=True, exist_ok=True)
+
+        # If an existing unpublished advisory is published now, remove the stale advisory from unpublished directory.
+        if self.has_unpublished and dir_path != ADVISORY_PATH / "unpublished":
+            stale_advisory = ADVISORY_PATH / "unpublished" / f"{euvd_id}.json"
+            if stale_advisory.exists():
+                stale_advisory.unlink()
+
+        # If old advisory is updated, the new data overwrites the existing file.
+        with (dir_path / f"{euvd_id}.json").open("w", encoding="utf-8") as f:
+            json.dump(advisory, f, indent=2)


Let's keep this simple, and use python-dateutil to parse date.

Suggested change

euvd_id = advisory.get("id")

if not euvd_id:

self.log(f"Advisory missing id, skipping: {advisory}")

return

date_published = advisory.get("datePublished", "")

dir_path = self.advisory_dir(date_published)

if dir_path is None:

dir_path = ADVISORY_PATH / "unpublished"

self.has_unpublished = True

dir_path.mkdir(parents=True, exist_ok=True)

# If an existing unpublished advisory is published now, remove the stale advisory from unpublished directory.

if self.has_unpublished and dir_path != ADVISORY_PATH / "unpublished":

stale_advisory = ADVISORY_PATH / "unpublished" / f"{euvd_id}.json"

if stale_advisory.exists():

stale_advisory.unlink()

# If old advisory is updated, the new data overwrites the existing file.

with (dir_path / f"{euvd_id}.json").open("w", encoding="utf-8") as f:

json.dump(advisory, f, indent=2)

destination = "unpublished"

euvd_id = advisory["id"]

if published := advisory.get("datePublished"):

date = parse(published)

destination = f"{date.year}/{date.month:02d}"

path = ADVISORIES_PATH / f"{destination}/{euvd_id}.json"

path.parent.mkdir(parents=True, exist_ok=True)

with open(path, "w", encoding="utf-8") as f:

json.dump(advisory, f, indent=2)

sync_catalog.py

Co-authored-by: Keshav Priyadarshi <git@keshav.space> Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 · 2026-03-11T14:13:18Z

Hey @keshav-space
Thanks for the review. I have implemented your exact suggestions. Do have a quick look.

Pipeline Log Excerpt

 python sync_catalog.py 
2026-03-11 19:09:26.673 Pipeline [EUVDAdvisoryMirror] starting
2026-03-11 19:09:26.673 Step [load_checkpoint] starting
2026-03-11 19:09:26.673 Step [load_checkpoint] completed in 0 seconds
2026-03-11 19:09:26.673 Step [create_session] starting
2026-03-11 19:09:26.673 Step [create_session] completed in 0 seconds
2026-03-11 19:09:26.674 Step [collect_new_advisory] starting
2026-03-11 19:09:27.570 Collecting 337107 advisories across 3372 pages
2026-03-11 19:18:46.685 Progress: 10% (338/3372) ETA: 5032 seconds (1.4 hours)
2026-03-11 19:32:45.135 Progress: 20% (675/3372) ETA: 5590 seconds (1.6 hours)
2026-03-11 19:49:03.171 Progress: 30% (1012/3372) ETA: 5543 seconds (1.5 hours)
2026-03-11 20:07:26.941 Progress: 40% (1349/3372) ETA: 5219 seconds (1.4 hours)
2026-03-11 20:29:15.445 Progress: 50% (1686/3372) ETA: 4788 seconds (1.3 hours)

Add EUVD mirror pipeline

8e69c78

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 mentioned this pull request Dec 8, 2025

Add V2_importer to collect advisories from EUVD aboutcode-org/vulnerablecode#2046

Open

ziadhany suggested changes Dec 9, 2025

View reviewed changes

sync_catalog.py Outdated Show resolved Hide resolved

Update DEFAULT_START_YEAR to Unix Epoch

f75cf19

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 requested a review from ziadhany December 11, 2025 14:06

ziadhany suggested changes Dec 15, 2025

View reviewed changes

Refactor EUVD sync as per suggestions

91cbede

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 requested a review from ziadhany December 19, 2025 13:54

Add logging to unexpected results from API

ecee45c

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

ziadhany approved these changes Dec 22, 2025

View reviewed changes

AyanSinhaMahapatra closed this Jan 5, 2026

AyanSinhaMahapatra reopened this Jan 5, 2026

keshav-space requested changes Jan 6, 2026

View reviewed changes

Address review feedback for EUVD catalog mirror pipeline

217d3ae

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 requested a review from keshav-space January 6, 2026 15:06

keshav-space requested changes Jan 13, 2026

View reviewed changes

Improve code readability and optimize initial fetches

c5ac22f

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 requested a review from keshav-space January 14, 2026 20:47

Refactor to checkpoint based mirror sync

26c826f

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

keshav-space requested changes Mar 11, 2026

View reviewed changes

Update REQUEST_TIMEOUT

2b065b9

Co-authored-by: Keshav Priyadarshi <git@keshav.space> Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 force-pushed the add_EUVD_mirror_pipeline branch from 6425109 to 954a6a3 Compare March 11, 2026 12:22

Remove empty line in sync_catalog.py

e9578fe

Co-authored-by: Keshav Priyadarshi <git@keshav.space> Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 force-pushed the add_EUVD_mirror_pipeline branch from 954a6a3 to e9578fe Compare March 11, 2026 12:27

Address review comments

9e9a6ff

Signed-off-by: Samk <sampurnapyne1710@gmail.com>

Samk1710 requested a review from keshav-space March 11, 2026 14:14

Uh oh!

Conversation

Samk1710 commented Dec 8, 2025

EUVD Mirror Pipeline

Uh oh!

Uh oh!

ziadhany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samk1710 commented Dec 15, 2025

Uh oh!

Samk1710 commented Dec 19, 2025

Uh oh!

ziadhany left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samk1710 commented Dec 30, 2025

Uh oh!

keshav-space left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samk1710 commented Jan 6, 2026

Uh oh!

keshav-space left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samk1710 commented Jan 14, 2026

Uh oh!

keshav-space commented Feb 6, 2026

Uh oh!

Samk1710 commented Feb 6, 2026

Uh oh!

keshav-space commented Feb 16, 2026

Uh oh!

Samk1710 commented Feb 16, 2026

Uh oh!

Samk1710 commented Feb 23, 2026

Uh oh!

keshav-space left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keshav-space Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Samk1710 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Samk1710 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziadhany left a comment •

edited

Loading

Samk1710 commented Mar 11, 2026 •

edited

Loading