From 3701812cb3ce041e93913a6863329c15ebf8844a Mon Sep 17 00:00:00 2001
From: Luca Foppiano <luca@foppiano.org>
Date: Mon, 16 Feb 2026 17:24:39 +0100
Subject: [PATCH 1/3] fix: remove awk scripts and aws instruction and replace
 them with the cc-downloader

---
 README.md | 49 +++++++++++++++++++------------------------------
 1 file changed, 19 insertions(+), 30 deletions(-)

diff --git a/README.md b/README.md
index 84a8395..a546e91 100644
--- a/README.md
+++ b/README.md
@@ -795,46 +795,35 @@ In case you want to run many of these queries, and you have a lot of disk space,
 > [!IMPORTANT]
 > If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
 
-To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run: 
+To download the crawl index, please use cc-downloader, which is a polite downloader for Common Crawl data:  
 
 ```shell
-mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
-aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
+cargo install cc-downloader
 ```
 
-If, by any other chance, you don't have access through the AWS CLI:
+cc-downloader will not be set up on your path by default, but you can run it by prepending the right path. 
 
 ```shell
-mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
-cd 'crawl=CC-MAIN-2024-22/subset=warc'
-
-wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
-gunzip cc-index-table.paths.gz
-
-grep 'subset=warc' cc-index-table.paths | \
-  awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
-  xargs -n 2 -P 10 sh -c '
-    echo "Downloading: $2"
-    mkdir -p "$(dirname "$2")" &&
-    wget -O "$2" "$1"
-  ' _
-
-rm cc-index-table.paths
-cd -
+mkdir crawl
+~/.cargo/bin/cc-downloader download-paths CC-MAIN-2024-22 cc-index-table crawl
+~/.cargo/bin/cc-downloader download  crawl/cc-index-table.paths.gz --progress crawl
 ```
 
 In both ways, the file structure should be something like this: 
 ```shell
-tree my_data
-my_data
-└── crawl=CC-MAIN-2024-22
-    └── subset=warc
-        ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
-        ├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
-        ├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
-```
-
-Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
+tree crawl/
+crawl/
+├── cc-index
+│   └── table
+│       └── cc-main
+│           └── warc
+│               └── crawl=CC-MAIN-2024-22
+│                   └── subset=warc
+│                       ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
+│                       ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c001.gz.parquet
+```
+
+Then, you can run `make duck_local_files LOCAL_DIR=crawl` to run the same query as above, but this time using your local copy of the index files.
 
 Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).
 

From ba2bd13a0a489e4cc0559c56ade12e9db09506cd Mon Sep 17 00:00:00 2001
From: Luca Foppiano <luca@foppiano.org>
Date: Mon, 16 Feb 2026 17:47:39 +0100
Subject: [PATCH 2/3] doc: add the link to the github repo of the cc-downloader

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index a546e91..f4e4eb5 100644
--- a/README.md
+++ b/README.md
@@ -795,7 +795,7 @@ In case you want to run many of these queries, and you have a lot of disk space,
 > [!IMPORTANT]
 > If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
 
-To download the crawl index, please use cc-downloader, which is a polite downloader for Common Crawl data:  
+To download the crawl index, please use [cc-downloader](https://github.com/commoncrawl/cc-downloader), which is a polite downloader for Common Crawl data:  
 
 ```shell
 cargo install cc-downloader

From 718c2f82ff42bfc8638cbccc53c3ae76c236c12c Mon Sep 17 00:00:00 2001
From: Luca Foppiano <luca@foppiano.org>
Date: Tue, 17 Feb 2026 19:23:29 +0100
Subject: [PATCH 3/3] fix: add info in case cargo is not available

---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index f4e4eb5..bd8498e 100644
--- a/README.md
+++ b/README.md
@@ -801,7 +801,8 @@ To download the crawl index, please use [cc-downloader](https://github.com/commo
 cargo install cc-downloader
 ```
 
-cc-downloader will not be set up on your path by default, but you can run it by prepending the right path. 
+`cc-downloader` will not be set up on your path by default, but you can run it by prepending the right path.
+If cargo is not available or does not install, please check on [the cc-downloader official repository](https://github.com/commoncrawl/cc-downloader). 
 
 ```shell
 mkdir crawl