Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 74 additions & 7 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,81 @@
# Apache Spark

This file provides context and guidelines for AI coding assistants working with the Apache Spark codebase.
## Before Making Changes

Before the first edit in a session, ensure a clean working environment. DO NOT skip these checks:

1. Run `git remote -v` to identify the personal fork and upstream (`apache/spark`). If unclear, ask the user to configure their remotes following the standard convention (`origin` for the fork, `upstream` for `apache/spark`).
2. If the latest commit on `<upstream>/master` is more than a day old (check with `git log -1 --format="%ci" <upstream>/master`), run `git fetch <upstream> master`.
3. If the current branch has uncommitted changes (check with `git status`) or commits not in upstream `master` (check with `git log <upstream>/master..HEAD`), ask the user to pick one:
- Create a new git worktree from `<upstream>/master` (recommended) and work from there.
- For uncommitted changes: stash them. For unmerged commits: create and switch to a new branch from `<upstream>/master`.
4. Otherwise, proceed on the current branch.

## Development Notes

SQL golden file tests are managed by `SQLQueryTestSuite` and its variants. Read the class documentation before running or updating these tests. DO NOT edit the generated golden files (`.sql.out`) directly. Always regenerate them when needed, and carefully review the diff to make sure it's expected.

Spark Connect protocol is defined in proto files under `sql/connect/common/src/main/protobuf/`. Read the README there before modifying proto definitions.

## Build and Test

Prefer building in sbt over maven:
Build and tests can take a long time. Before running tests, ask the user if they have more changes to make.

Prefer SBT over Maven for faster incremental compilation. Module names are defined in `project/SparkBuild.scala`.

Compile a single module:

build/sbt <module>/compile

Compile test code for a single module:

build/sbt <module>/Test/compile

Run test suites by wildcard or full class name:

build/sbt '<module>/testOnly *MySuite'
build/sbt '<module>/testOnly org.apache.spark.sql.MySuite'

Run test cases matching a substring:

build/sbt '<module>/testOnly *MySuite -- -z "test name"'

For faster iteration, keep SBT open in interactive mode:

build/sbt
> project <module>
> testOnly *MySuite

### PySpark Tests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @gaogaotiantian @Yicong-Huang for the pyspark test part.


PySpark tests require building Spark with Hive support first:

build/sbt -Phive package

Activate the virtual environment specified by the user, or default to `.venv`:

source <venv>/bin/activate

If the default venv does not exist, create it:

python3 -m venv .venv
source .venv/bin/activate
pip install -r dev/requirements.txt

Run a single test suite:

python/run-tests --testnames pyspark.sql.tests.arrow.test_arrow

Run a single test case:

python/run-tests --testnames "pyspark.sql.tests.test_catalog CatalogTests.test_current_database"

## Pull Request Workflow

PR title requires a JIRA ticket ID (e.g., `[SPARK-xxxx][SQL] Title`). Ask the user to create a new ticket or provide an existing one if not given. Follow the template in `.github/PULL_REQUEST_TEMPLATE` for the PR description.

DO NOT push to the upstream repo. Always push to the personal fork. Open PRs against `master` on the upstream repo.

- **Building Spark**: [docs/building-spark.md](docs/building-spark.md)
- SBT build instructions: See the ["Building with SBT"](docs/building-spark.md#building-with-sbt) section
- SBT testing: See the ["Testing with SBT"](docs/building-spark.md#testing-with-sbt) section
- Running individual tests: See the ["Running Individual Tests"](docs/building-spark.md#running-individual-tests) section
DO NOT force push or use `--amend` on pushed commits unless the user explicitly asks. If the remote branch has new commits, fetch and rebase before pushing.

- **PySpark Testing**: [python/docs/source/development/testing.rst](python/docs/source/development/testing.rst)
Always get user approval before external operations such as pushing commits, creating PRs, or posting comments. Use `gh pr create` to open PRs. If `gh` is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI.
1 change: 1 addition & 0 deletions CLAUDE.md