-
-
Notifications
You must be signed in to change notification settings - Fork 71
Fix: ArXiv API Migration - OAI-PMH Implementation #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Opsmithe
wants to merge
41
commits into
creativecommons:main
Choose a base branch
from
Opsmithe:arxiv-minimal-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+345
−302
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Updated base URL from v3 to v4 API endpoint - Added publisher information extraction (name and country) - Added article sampling functionality for license analysis - Enhanced CSV output with new publisher and article count files - Improved error handling and logging for v4 API structure - Updated provenance tracking to include API version - Maintained backward compatibility with existing data structure Benefits of v4 migration: - Access to richer metadata including publisher details - Better structured response format with pagination info - Enhanced license information extraction capabilities - Improved data quality for commons quantification analysis
…nformation - Generated doaj_6_count_by_publisher.csv with publisher name and country data - Added doaj_5_article_count.csv for article sampling statistics - Updated provenance.yaml to track API v4 usage and enhanced data collection - Publisher data includes institutions from IR, PL, CL, GB, RU, BR, ID countries - Article sampling demonstrates new capability to analyze article-level data - All existing data files (count, subject, language, year) maintained compatibility Test run processed 10 journals and 1 article sample successfully.
- Extract detailed license flags (BY, NC, ND, SA) from DOAJ v4 API response - Add doaj_7_license_details.csv to capture license component breakdown - Enhanced extract_license_type() to return both license type and detailed components - Updated data processing pipeline to handle granular license information - Added license URL tracking for verification and compliance analysis New capabilities: - Identify specific Creative Commons license components used by journals - Track license URLs for direct reference to legal terms - Enable analysis of license component combinations and trends - Support more precise commons quantification based on usage restrictions Test data shows successful extraction of BY, NC, SA flags and license URLs.
- Document complete migration process from v3 to v4 API - Detail all enhanced data collection capabilities - Provide technical implementation overview - Include validation results and test data analysis - Document new CSV file schemas and data structures - Outline future enhancement opportunities - Reference all related commits for audit trail Key documentation sections: - API endpoint changes and migration rationale - Enhanced license component analysis capabilities - Publisher and geographic data collection - Article processing implementation - Data quality improvements and validation - Performance optimizations and error handling - Impact on commons quantification research
… integration - Remove boolean license component extraction (BY, NC, ND, SA flags) - Remove doaj_7_license_details.csv file generation - Simplify extract_license_type() to return only license type string - Remove license_details_counts processing from data pipeline - Maintain focus on meaningful license type classification Rationale: License type string (e.g., 'CC BY-NC') already contains all necessary information. Boolean flags add complexity without providing additional analytical value for commons quantification purposes.
- Remove doaj_fetch.py script (moved to feature/doaj branch) - Remove all DOAJ data files (moved to feature/doaj branch) - Remove DOAJ_V4_MIGRATION.md documentation (moved to feature/doaj branch) This branch now focuses exclusively on ArXiv-related improvements. All DOAJ v4 migration work has been moved to dedicated feature/doaj branch.
Author
|
@TimidRobot I hope this fixes help address the issues with #236, ild be happy to get your reviews on the script and make changes where necessary |
…iptive identifiers
…iptive identifiers
5920576 to
5739ad3
Compare
TimidRobot
reviewed
Nov 14, 2025
TimidRobot
reviewed
Nov 14, 2025
TimidRobot
reviewed
Nov 14, 2025
TimidRobot
reviewed
Nov 14, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes
Problem
The current
arxiv_fetch.pyscript relies on ArXiv's Atom API which provides unreliable license information through two problematic fields:<rights>field: Does not exist in the API response schema according to ArXiv API documentation<summary>field: Uses text pattern matching which incorrectly identifies papers that discuss CC licenses rather than papers that are actually CC-licensedImpact
Example Case
Query
[ALL:]("CC BY")returns paper [2008.00774v3] "Elsevier OA CC-By Corpus" which discusses CC BY works but is actually licensed under "arXiv.org - Non-exclusive license to distribute", not CC BY.Proposed Solution
Migrate
arxiv_fetch.pyto use ArXiv's OAI-PMH API (https://oaipmh.arxiv.org/oai) which provides:<license>elements in arXiv namespaceQuery Strategy Implemented
API Endpoint Migration
License Extraction Method
Request Parameters
Implementation Details
New Features
Data Structure Changes
Performance Improvements
API Requirements
https://oaipmh.arxiv.org/oaiChecklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin