Add git-annex support for upload with auto-detection#1797
Add git-annex support for upload with auto-detection#1797
Conversation
- Add DataladMode enum with yes/no/auto options - Add --datalad CLI option to upload command - Add utility functions to detect git-annex repos and register URLs - Add backend checking to warn if not SHA256E - Integrate datalad URL registration after successful upload Co-authored-by: yarikoptic <39889+yarikoptic@users.noreply.github.com>
- Add test_datalad_mode_enum to verify enum values - Add test_is_git_annex_repo for repository detection - Add test_get_file_annex_backend for backend checking - Register ai_generated marker in tox.ini - Tests skip gracefully when git-annex is not available Co-authored-by: yarikoptic <39889+yarikoptic@users.noreply.github.com>
- Fix dandiset.path type error (use path_obj instead) - Fix line length issue in debug message - All mypy and flake8 checks pass Co-authored-by: yarikoptic <39889+yarikoptic@users.noreply.github.com>
- Improve _get_file_annex_backend to verify annex directory structure - Add detailed comment about contentUrl array usage - Use tuple for DataladMode choices in CLI - Keep __str__ method as it's needed for proper string conversion Co-authored-by: yarikoptic <39889+yarikoptic@users.noreply.github.com>
- Change from list to tuple comprehension to avoid creating new list on each import - More efficient and follows best practices Co-authored-by: yarikoptic <39889+yarikoptic@users.noreply.github.com>
dandi/upload.py
Outdated
| # Find git repository root | ||
| result = subprocess.run( | ||
| ["git", "rev-parse", "--show-toplevel"], | ||
| cwd=path if path.is_dir() else path.parent, | ||
| capture_output=True, | ||
| text=True, | ||
| check=False, | ||
| ) | ||
| if result.returncode == 0: | ||
| repo_root = Path(result.stdout.strip()) | ||
| annex_dir = repo_root / ".git" / "annex" | ||
| return annex_dir.exists() |
There was a problem hiding this comment.
better check that there is git-annex branch and git config annex.uuid is defined. A folder could be left artifact
There was a problem hiding this comment.
Done in fbb8785. Now checking for git-annex branch and annex.uuid config to properly validate git-annex initialization.
dandi/cli/cmd_upload.py
Outdated
| show_default=True, | ||
| ) | ||
| @click.option( | ||
| "--datalad", |
There was a problem hiding this comment.
seems we are not doing anything "datalad" specific and operate directly with git-annex, let's call it --git-annex (and GitAnnexMode) and do 'auto' by default.
There was a problem hiding this comment.
Done in fbb8785. Renamed to --git-annex with GitAnnexMode enum and changed default to auto.
dandi/tests/test_upload.py
Outdated
| """Test detection of git-annex repositories.""" | ||
| import subprocess | ||
|
|
||
| from ..upload import _is_git_annex_repo |
There was a problem hiding this comment.
imports should be at the top of the module sorted (pre-commit)
There was a problem hiding this comment.
Done in fbb8785. Moved subprocess import to top of test module.
dandi/tests/test_upload.py
Outdated
| ["git", "annex", "init"], cwd=tmp_path, check=True, capture_output=True | ||
| ) | ||
|
|
||
| # Create a test file |
There was a problem hiding this comment.
fold the tests following after this line into test_is_git_annex_repo and rename it like test_git_annex_repo_ops
There was a problem hiding this comment.
Done in fbb8785. Consolidated into single test_git_annex_repo_ops that tests enum, detection, and backend operations.
| return None | ||
|
|
||
| # Find the part that looks like an annex key (contains backend and size) | ||
| for part in target_path.parts: |
There was a problem hiding this comment.
Done in fbb8785. Now using Path(target).name to extract just the filename.
dandi/upload.py
Outdated
| ) | ||
|
|
||
| if result.returncode == 0: | ||
| lgr.info("Successfully registered %s with git-annex", filepath) |
There was a problem hiding this comment.
| lgr.info("Successfully registered %s with git-annex", filepath) | |
| lgr.info("Successfully registered url %s to %s with git-annex", remote_url, filepath) |
There was a problem hiding this comment.
Done in fbb8785. Updated log message to include both URL and filepath.
dandi/upload.py
Outdated
| return True | ||
| else: | ||
| lgr.warning( | ||
| "Failed to register %s with git-annex: %s", |
There was a problem hiding this comment.
| "Failed to register %s with git-annex: %s", | |
| "Failed to register url %s for %s with git-annex: %s", url, |
There was a problem hiding this comment.
Done in fbb8785. Updated log message to include URL in the failure message.
| except Exception as e: | ||
| lgr.warning( | ||
| "%s: Error during datalad integration: %s", strpath, e | ||
| ) |
There was a problem hiding this comment.
remove swallow + warn -- let it raise. If doesn't work, user should disable use of git-annex. We might want to hint to use --git-annex no or whatever that option would be.
There was a problem hiding this comment.
Done in fbb8785. Removed try/except wrapper - errors now raise with helpful hint message to use --git-annex no.
| "%s: No contentUrl found in asset metadata; " | ||
| "skipping datalad registration", | ||
| strpath, | ||
| ) |
There was a problem hiding this comment.
it might take time to get URL registered in backend. So we would need to loop for e.g. up to 20 seconds, reasking each 2 and then failing if none is provided after 20 seconds, and exiting the loop if we successfully registered the URL.
note that there could be multiple URLs, review code in https://github.com/dandi/backups2datalad on dealing with public vs private / embargoed dandisets. For embargoed we need to ensure enabling datalad special remote and associate api based URLs to facilitate authentication. If public -- then public S3 urls.
There was a problem hiding this comment.
Done in fbb8785. Added polling loop (20s timeout, 2s intervals) for contentUrl. Implemented S3 URL preference for public dandisets. Note: embargoed dandisets with datalad special remote authentication would need additional work - for now this handles the public case which is the primary use case.
tox.ini
Outdated
| [pytest] | ||
| addopts = --tb=short --durations=10 --timeout=300 | ||
| markers = | ||
| ai_generated: marks tests as AI-generated (deselect with '-m "not ai_generated"') |
There was a problem hiding this comment.
drop this change -- we have it defined already in
dandi/pytest_plugin.py: "ai_generated",
There was a problem hiding this comment.
Done in fbb8785. Removed from tox.ini as it's already defined in dandi/pytest_plugin.py.
|
also needs a unittest, may be in an extended run of one of CI runs where git-annex does get installed and we do try to upload a file from within git-annex repo (embargoed and not) and verify that URL gets assigned. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1797 +/- ##
===========================================
- Coverage 75.10% 50.18% -24.92%
===========================================
Files 84 84
Lines 11921 12030 +109
===========================================
- Hits 8953 6037 -2916
- Misses 2968 5993 +3025
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Rename DataladMode to GitAnnexMode and --datalad to --git-annex - Change default from 'no' to 'auto' - Improve git-annex repo detection (check for git-annex branch and annex.uuid) - Use filename for backend extraction - Improve log messages with URL details - Add polling loop for contentUrl (20s timeout with 2s intervals) - Select appropriate URL (prefer S3 for public access) - Remove exception swallowing - let errors raise with helpful hints - Consolidate tests into test_git_annex_repo_ops - Move subprocess import to top of test module - Remove ai_generated marker from tox.ini (already in pytest_plugin.py) Co-authored-by: yarikoptic <39889+yarikoptic@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Enables linking local uploaded files with their remote URLs via git-annex, and validates that annexed files use the SHA256E backend standard.
Changes
CLI
--git-annex [yes|no|auto]option (default:auto)auto: detects git-annex repository (checks for git-annex branch and annex.uuid config) and enables if found (default)yes: always registers URLs with git-annex post-uploadno: disables git-annex integrationCore Implementation
contentUrl(up to 20s with 2s intervals) from asset metadata and callsgit annex addurl --relaxed --file <path> <url>--git-annex no_is_git_annex_repo(): checks for git-annex branch and annex.uuid config_get_file_annex_backend(): parses symlink target filename for backend key (validatesannex/objectspath structure)_register_url_with_annex(): spawns git-annex subprocessExample
Tests
test_git_annex_repo_opscovers enum, detection, and backend parsingOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.