diff --git a/.agent/skills/beam-concepts/SKILL.md b/.agent/skills/beam-concepts/SKILL.md index da3dd9fbf319..ef459611f9ab 100644 --- a/.agent/skills/beam-concepts/SKILL.md +++ b/.agent/skills/beam-concepts/SKILL.md @@ -17,18 +17,14 @@ # under the License. name: beam-concepts -description: Explains core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, and Runners. Use when learning Beam fundamentals or explaining pipeline concepts. +description: Explains, demonstrates, and troubleshoots core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, Runners, windowing, and triggers. Use when learning Beam fundamentals, writing transforms, debugging pipeline errors, or comparing runner options. --- # Apache Beam Core Concepts -## The Beam Model -Evolved from Google's MapReduce, FlumeJava, and Millwheel projects. Originally called the "Dataflow Model." - ## Key Abstractions ### Pipeline -A Pipeline encapsulates the entire data processing task, including reading, transforming, and writing data. ```java // Java @@ -48,17 +44,9 @@ with beam.Pipeline(options=options) as p: ``` ### PCollection -A distributed dataset that can be bounded (batch) or unbounded (streaming). - -#### Properties -- **Immutable** - Once created, cannot be modified -- **Distributed** - Elements processed in parallel -- **May be bounded or unbounded** -- **Timestamped** - Each element has an event timestamp -- **Windowed** - Elements assigned to windows +Distributed dataset — bounded (batch) or unbounded (streaming). Immutable, timestamped, windowed. ### PTransform -A data processing operation that transforms PCollections. ```java // Java @@ -73,7 +61,6 @@ output = input | 'Name' >> beam.ParDo(MyDoFn()) ## Core Transforms ### ParDo -General-purpose parallel processing. ```java // Java @@ -97,18 +84,13 @@ input | beam.Map(len) ``` ### GroupByKey -Groups elements by key. ```java PCollection> input = ...; PCollection>> grouped = input.apply(GroupByKey.create()); ``` -### CoGroupByKey -Joins multiple PCollections by key. - -### Combine -Combines elements (sum, mean, etc.). +### CoGroupByKey / Combine ```java // Global combine @@ -118,24 +100,16 @@ input.apply(Combine.globally(Sum.ofIntegers())); input.apply(Combine.perKey(Sum.ofIntegers())); ``` -### Flatten -Merges multiple PCollections. +### Flatten / Partition ```java PCollectionList collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection merged = collections.apply(Flatten.pCollections()); ``` -### Partition -Splits a PCollection into multiple PCollections. - ## Windowing -### Types -- **Fixed Windows** - Regular, non-overlapping intervals -- **Sliding Windows** - Overlapping intervals -- **Session Windows** - Gaps of inactivity define boundaries -- **Global Window** - All elements in one window (default) +Types: Fixed (non-overlapping), Sliding (overlapping), Session (gap-based), Global (default). ```java input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5)))); @@ -146,7 +120,6 @@ input | beam.WindowInto(beam.window.FixedWindows(300)) ``` ## Triggers -Control when results are emitted. ```java input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))) @@ -158,7 +131,6 @@ input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))) ``` ## Side Inputs -Additional inputs to ParDo. ```java PCollectionView> sideInput = @@ -174,7 +146,6 @@ mainInput.apply(ParDo.of(new DoFn() { ``` ## Pipeline Options -Configure pipeline execution. ```java public interface MyOptions extends PipelineOptions { @@ -188,7 +159,6 @@ MyOptions options = PipelineOptionsFactory.fromArgs(args).as(MyOptions.class); ``` ## Schema -Strongly-typed access to structured data. ```java @DefaultSchema(AutoValueSchema.class) @@ -222,10 +192,14 @@ PCollectionTuple results = input.apply(ParDo.of(new DoFn() { results.get(successTag).apply(WriteToSuccess()); results.get(failureTag).apply(WriteToDeadLetter()); + +// Verify: check dead letter output is non-empty in tests +PAssert.that(results.get(failureTag)).satisfies(dlq -> { + assert dlq.iterator().hasNext(); return null; +}); ``` ## Cross-Language Pipelines -Use transforms from other SDKs. ```python # Use Java Kafka connector from Python diff --git a/.agent/skills/ci-cd/SKILL.md b/.agent/skills/ci-cd/SKILL.md index 6b5bc3b0134d..6380ea94cb2b 100644 --- a/.agent/skills/ci-cd/SKILL.md +++ b/.agent/skills/ci-cd/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: ci-cd -description: Guides understanding and working with Apache Beam's CI/CD system using GitHub Actions. Use when debugging CI failures, understanding test workflows, or modifying CI configuration. +description: Debugs CI failures, analyzes workflow logs, configures test matrices, and troubleshoots flaky tests in Apache Beam's GitHub Actions CI/CD system. Use when debugging CI failures, understanding test workflows, re-running failed checks, or modifying CI configuration. --- # CI/CD in Apache Beam @@ -44,34 +44,11 @@ Apache Beam uses GitHub Actions for CI/CD. Workflows are located in `.github/wor ## Key Workflows -### PreCommit -| Workflow | Description | -|----------|-------------| -| `beam_PreCommit_Java.yml` | Java build and tests | -| `beam_PreCommit_Python.yml` | Python tests | -| `beam_PreCommit_Go.yml` | Go tests | -| `beam_PreCommit_RAT.yml` | License header checks | -| `beam_PreCommit_Spotless.yml` | Code formatting | - -### PostCommit - Java -| Workflow | Description | -|----------|-------------| -| `beam_PostCommit_Java.yml` | Full Java test suite | -| `beam_PostCommit_Java_ValidatesRunner_*.yml` | Runner validation tests | -| `beam_PostCommit_Java_Examples_*.yml` | Example pipeline tests | - -### PostCommit - Python -| Workflow | Description | -|----------|-------------| -| `beam_PostCommit_Python.yml` | Full Python test suite | -| `beam_PostCommit_Python_ValidatesRunner_*.yml` | Runner validation | -| `beam_PostCommit_Python_Examples_*.yml` | Examples | - -### Load & Performance Tests -| Workflow | Description | -|----------|-------------| -| `beam_LoadTests_*.yml` | Load testing | -| `beam_PerformanceTests_*.yml` | I/O performance | +Naming convention: `beam_{PreCommit,PostCommit}_{Language}[_Variant].yml` + +- **PreCommit**: `Java`, `Python`, `Go`, `RAT` (license), `Spotless` (formatting) +- **PostCommit**: full test suites, `ValidatesRunner_*`, `Examples_*` +- **Performance**: `LoadTests_*`, `PerformanceTests_*` ## Triggering Tests @@ -79,11 +56,14 @@ Apache Beam uses GitHub Actions for CI/CD. Workflows are located in `.github/wor - PRs trigger PreCommit tests - Merges trigger PostCommit tests -### Triggering Specific Workflows -Use [trigger files](https://github.com/apache/beam/blob/master/.github/workflows/README.md#running-workflows-manually) to run specific workflows. +### Re-running Specific Workflows +```bash +# Via GitHub CLI +gh workflow run beam_PreCommit_Java.yml --ref your-branch -### Workflow Dispatch -Most workflows support manual triggering via GitHub UI. +# Or use trigger files per the workflows README +``` +See [trigger files](https://github.com/apache/beam/blob/master/.github/workflows/README.md#running-workflows-manually) for the full trigger phrase catalog. ## Understanding Test Results @@ -95,17 +75,25 @@ Most workflows support manual triggering via GitHub UI. ### Common Failure Patterns -#### Flaky Tests -- Random failures unrelated to change -- Solution: Use [trigger files](https://github.com/apache/beam/blob/master/.github/workflows/README.md#running-workflows-manually) to re-run the specific workflow. - -#### Timeout -- Increase timeout in workflow if justified -- Or optimize test - -#### Resource Exhaustion -- GCP quota issues -- Check project settings +#### Debugging Workflow +1. **Check if flaky**: review the workflow's recent runs in the Actions tab for the same test + ```bash + gh run list --workflow=beam_PreCommit_Java.yml --limit=10 + ``` +2. **If flaky**: re-run the workflow + ```bash + gh run rerun --failed + ``` +3. **If consistent**: reproduce locally using the same command from the workflow's `run` step + ```bash + # Example: find the failing command in the workflow file + grep -A5 'run:' .github/workflows/beam_PreCommit_Java.yml + # Then run it locally + ./gradlew :sdks:java:core:test --info 2>&1 | tail -50 + ``` +4. **If timeout**: check test runtime with `--info` flag; increase timeout only if justified +5. **If resource exhaustion**: check GCP quotas in project settings +6. **Verify fix**: push and confirm the workflow passes in the PR checks tab ## GCP Credentials @@ -159,17 +147,20 @@ jobs: ## Local Debugging -### Run Same Commands as CI -Check workflow file's `run` commands: +Reproduce CI commands locally by reading the workflow's `run` step: ```bash -./gradlew :sdks:java:core:test +# Java PreCommit equivalent +./gradlew :sdks:java:core:test --info + +# Python PreCommit equivalent ./gradlew :sdks:python:test -``` -### Common Issues -- Clean gradle cache: `rm -rf ~/.gradle .gradle` -- Remove build directory: `rm -rf build` -- Check Java version matches CI +# If build state is stale +rm -rf ~/.gradle/caches .gradle build && ./gradlew clean + +# Verify Java version matches CI +java -version # CI typically uses JDK 11 +``` ## Snapshot Builds diff --git a/.agent/skills/contributing/SKILL.md b/.agent/skills/contributing/SKILL.md index bac50c5d0cd5..d280b1f9c098 100644 --- a/.agent/skills/contributing/SKILL.md +++ b/.agent/skills/contributing/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: contributing -description: Guides the contribution workflow for Apache Beam, including creating PRs, issue management, code review process, and release cycles. Use when contributing code, creating PRs, or understanding the contribution process. +description: Guides the Apache Beam contribution workflow including creating pull requests, signing the CLA, running precommit checks, labeling issues, and following commit conventions. Use when contributing code, submitting a pull request, understanding how to contribute, or navigating the code review process. --- # Contributing to Apache Beam @@ -69,8 +69,10 @@ description: Guides the contribution workflow for Apache Beam, including creatin - Use descriptive commit messages ### 5. Create Pull Request +- Run pre-commit tests locally before pushing: `./gradlew javaPreCommit` (Java), `./gradlew :sdks:python:test` (Python) +- Verify tests pass, then push and open PR - Link to the issue in PR description -- Pre-commit tests run automatically +- Pre-commit tests run automatically on the PR - If tests fail unrelated to your change, comment: `retest this please` ### 6. Code Review diff --git a/.agent/skills/gradle-build/SKILL.md b/.agent/skills/gradle-build/SKILL.md index 8de98f4fb95b..083ba031df40 100644 --- a/.agent/skills/gradle-build/SKILL.md +++ b/.agent/skills/gradle-build/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: gradle-build -description: Guides understanding and using the Gradle build system in Apache Beam. Use when building projects, understanding dependencies, or troubleshooting build issues. +description: Configures build.gradle files, runs Gradle tasks, resolves dependency conflicts, and debugs compilation errors in Apache Beam's mono-repo build system. Use when running ./gradlew commands, troubleshooting build failures, managing dependencies, or understanding the BeamModulePlugin. --- # Gradle Build System in Apache Beam @@ -207,18 +207,12 @@ rm -rf .gradle rm -rf build ``` -### Common Errors - -#### NoClassDefFoundError -- Run `./gradlew clean` -- Delete gradle cache - -#### Proto-related Errors -- Regenerate protos: `./gradlew generateProtos` - -#### Dependency Conflicts -- Check dependencies: `./gradlew dependencies` -- Use `--scan` for detailed analysis +### Troubleshooting Workflow +1. If build fails, check error type in output +2. **NoClassDefFoundError**: run `./gradlew clean` then retry; if persists, delete `~/.gradle/caches` +3. **Proto-related errors**: run `./gradlew generateProtos` then retry build +4. **Dependency conflicts**: run `./gradlew :module:dependencies --configuration runtimeClasspath` to inspect, use `--scan` for detailed analysis +5. Verify fix: re-run the original build command to confirm success ### Useful Tasks diff --git a/.agent/skills/io-connectors/SKILL.md b/.agent/skills/io-connectors/SKILL.md index 596b602add6c..adccc20d6877 100644 --- a/.agent/skills/io-connectors/SKILL.md +++ b/.agent/skills/io-connectors/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: io-connectors -description: Guides development and usage of I/O connectors in Apache Beam. Use when working with I/O connectors, creating new connectors, or debugging data source/sink issues. +description: Implements read/write transforms, configures connection parameters, and tests I/O connectors (BigQuery, Kafka, JDBC, Pub/Sub, GCS) in Apache Beam. Use when reading from or writing to external data sources, creating new connectors, or debugging data source/sink issues. --- # I/O Connectors in Apache Beam @@ -191,7 +191,11 @@ Beam supports using I/O connectors from one SDK in another via the expansion ser ## Creating New Connectors See [Developing I/O connectors](https://beam.apache.org/documentation/io/developing-io-overview) -Key components: -1. **Source** - Reads data (bounded or unbounded) -2. **Sink** - Writes data -3. **Read/Write transforms** - User-facing API +### Workflow +1. Implement Source (bounded or unbounded read) +2. Test Source with DirectRunner: `./gradlew :sdks:java:io:myio:test` +3. Implement Sink (write) +4. Create user-facing Read/Write transforms +5. Write integration tests using `TestPipeline` +6. Run integration tests: `./gradlew :sdks:java:io:myio:integrationTest` +7. Verify both read and write paths produce expected results diff --git a/.agent/skills/java-development/SKILL.md b/.agent/skills/java-development/SKILL.md index f7e89beb895d..7544989a69b5 100644 --- a/.agent/skills/java-development/SKILL.md +++ b/.agent/skills/java-development/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: java-development -description: Guides Java SDK development in Apache Beam, including building, testing, running examples, and understanding the project structure. Use when working with Java code in sdks/java/, runners/, or examples/java/. +description: Guides Java SDK development in Apache Beam including compiling, running unit and integration tests, building SDK containers, and publishing artifacts. Use when working with Java code in sdks/java/, runners/, or examples/java/, or running ./gradlew Java tasks. --- # Java Development in Apache Beam @@ -119,6 +119,9 @@ Set pipeline options via `-DbeamTestPipelineOptions='[...]'`: # Publish a specific module ./gradlew -Ppublishing -p sdks/java/io/kafka publishToMavenLocal +# Verify: check artifact exists +ls ~/.m2/repository/org/apache/beam/beam-sdks-java-io-kafka/ + # Publish all modules ./gradlew -Ppublishing publishToMavenLocal ``` diff --git a/.agent/skills/license-compliance/SKILL.md b/.agent/skills/license-compliance/SKILL.md index d2fc50e541f3..e38b5fa23371 100644 --- a/.agent/skills/license-compliance/SKILL.md +++ b/.agent/skills/license-compliance/SKILL.md @@ -17,13 +17,17 @@ # under the License. name: license-compliance -description: Ensures all new files include proper Apache 2.0 license headers. Use when creating any new file in the Apache Beam repository. +description: Adds, validates, and formats Apache 2.0 license headers for all file types in Apache Beam. Use when creating new files, fixing RAT check failures, adding copyright headers, or checking license compliance. --- # License Compliance in Apache Beam -## Overview -Every source file in Apache Beam **MUST** include the Apache 2.0 license header. The RAT (Release Audit Tool) check will fail if any file is missing the required license. +## Workflow +1. Create new file +2. Add the appropriate license header from templates below (must be the very first content) +3. Run `./gradlew rat` to validate +4. If failures, check `build/reports/rat/index.html` for details +5. Fix any missing headers and re-run until passing ## License Headers by File Type @@ -68,106 +72,17 @@ Every source file in Apache Beam **MUST** include the Apache 2.0 license header. # ``` -### Go -```go -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. -``` - -### Markdown (.md) -```markdown - -``` +### Go (`//` comments) +Same license text as above using `//` comment prefix. -### YAML (.yml, .yaml) and YAML Frontmatter -```yaml -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. -``` - -### Shell Scripts (.sh, .bash) -```bash -#!/bin/bash -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -``` +### Python, YAML, Shell (`#` comments) +Same license text using `#` comment prefix. For shell scripts, place after `#!/bin/bash` shebang. -### XML, HTML +### Markdown, XML, HTML (`` comments) ```xml ``` diff --git a/.agent/skills/python-development/SKILL.md b/.agent/skills/python-development/SKILL.md index ed71bf9acc2b..b5b339f19cfa 100644 --- a/.agent/skills/python-development/SKILL.md +++ b/.agent/skills/python-development/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: python-development -description: Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/. +description: Guides Python SDK development in Apache Beam including configuring virtual environments, running pytest suites, building SDK tarballs, writing transforms and DoFns, and debugging pipeline execution. Use when working with Python code in sdks/python/, writing PCollections or ParDo transforms, or running Beam pipelines. --- # Python Development in Apache Beam @@ -115,6 +115,7 @@ python -m pytest -o log_cli=True -o log_level=Info \ ```bash cd sdks/python pip install build && python -m build --sdist +# Verify: ls -la dist/*.tar.gz # Output: sdks/python/dist/apache-beam-X.XX.0.dev0.tar.gz ``` diff --git a/.agent/skills/runners/SKILL.md b/.agent/skills/runners/SKILL.md index f92943ab097c..cf7a31f1926d 100644 --- a/.agent/skills/runners/SKILL.md +++ b/.agent/skills/runners/SKILL.md @@ -17,7 +17,7 @@ # under the License. name: runners -description: Guides understanding and working with Apache Beam runners (Direct, Dataflow, Flink, Spark, etc.). Use when configuring pipelines for different execution environments or debugging runner-specific issues. +description: Configures, optimizes, and troubleshoots Apache Beam runners (Direct, Dataflow, Flink, Spark) for pipeline execution. Use when configuring pipelines for different execution environments, debugging runner-specific issues, or running ValidatesRunner tests. --- # Apache Beam Runners @@ -230,9 +230,14 @@ Start Spark job server: ## Debugging +### Debugging Workflow +1. Run pipeline with DirectRunner first (`--runner=DirectRunner --targetParallelism=1`) to isolate logic errors +2. Enable debug logging: `-Dorg.slf4j.simpleLogger.defaultLogLevel=debug` +3. If DirectRunner passes but target runner fails, check runner-specific constraints below +4. Verify fix by re-running on the target runner + ### Direct Runner -- Enable logging: `-Dorg.slf4j.simpleLogger.defaultLogLevel=debug` -- Use `--targetParallelism=1` for deterministic execution +- Use `--targetParallelism=1` for deterministic, reproducible execution ### Dataflow - Check Dataflow UI: console.cloud.google.com/dataflow diff --git a/pr_description.md b/pr_description.md new file mode 100644 index 000000000000..2a0d412844f8 --- /dev/null +++ b/pr_description.md @@ -0,0 +1,59 @@ +Hullo @apache 👋 + +I ran your skills through `tessl skill review` at work and found some targeted improvements. Here's the full before/after: + +| Skill | Before | After | Change | +|-------|--------|-------|--------| +| license-compliance | 70% | 96% | +26% | +| gradle-build | 77% | 96% | +19% | +| io-connectors | 77% | 94% | +17% | +| beam-concepts | 77% | 90% | +13% | +| contributing | 81% | 94% | +13% | +| python-development | 77% | 90% | +13% | +| ci-cd | 81% | 89% | +8% | +| java-development | 83% | 90% | +7% | +| runners | 85% | 90% | +5% | + +![Score Card](score_card.png) + +
+Changes summary + +**Descriptions (all 9 skills)** +- Expanded action verbs beyond generic "Guides understanding" to specific actions like "Configures, debugs, implements" +- Added natural trigger terms users would actually type (e.g., "build.gradle", "gradlew", "pull request", "CLA", "RAT check") +- Ensured every description has an explicit "Use when..." clause with multiple trigger scenarios + +**beam-concepts**: Removed explanatory prose Claude already knows (historical context, property definitions), tightened PCollection/PTransform descriptions, added verification step to Dead Letter Queue pattern + +**ci-cd**: Replaced verbose workflow tables with compact naming convention reference, added concrete `gh` CLI commands for listing/rerunning workflows, added executable debugging workflow with copy-paste ready commands + +**contributing**: Added validation checkpoint to run pre-commit tests locally before pushing, expanded trigger terms to include "pull request", "CLA", "how to contribute" + +**gradle-build**: Replaced flat error list with structured troubleshooting workflow including explicit verification steps for each error type + +**io-connectors**: Replaced bare component list for creating new connectors with step-by-step workflow including test and verification checkpoints + +**java-development**: Added artifact verification step after `publishToMavenLocal` + +**license-compliance**: Added explicit 5-step compliance workflow with RAT check validation loop, consolidated repetitive license headers (8 near-identical blocks) into grouped format by comment style + +**python-development**: Added tarball verification step after building source distribution + +**runners**: Added structured debugging workflow: start with DirectRunner to isolate logic errors, then escalate to target runner + +
+ +--- + + - [x] No issue referenced (skill improvements only, no functional code changes) + - [ ] Update `CHANGES.md` with noteworthy changes. *(N/A — skill files only)* + - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). *(Small contribution — skill metadata and content improvements only)* + +--- + +Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch - just saw room for improvement and wanted to contribute. + +Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me - [@popey](https://github.com/popey) - if you hit any snags. + +Thanks in advance 🙏