Skip to content

[DRAFT] experiment: add test sharding#17438

Draft
daniel-sanche wants to merge 41 commits into
mainfrom
ci_sharding
Draft

[DRAFT] experiment: add test sharding#17438
daniel-sanche wants to merge 41 commits into
mainfrom
ci_sharding

Conversation

@daniel-sanche

@daniel-sanche daniel-sanche commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Experimenting with test sharding

Unit Tests

  • if less than 10 packages are changed, execute one test per python version (existing behaviour)
  • if more than 10 packages would be in a test, add an additional shard for each python runtime, and spread out packages between them
  • use an upper limit of 10 shards (per runtime), with packages evenly distributed
  • adds an end unit test complete step, which is only green if all shards pass. This can be our new required check for unit tests

System Tests

  • Previously system tests run sequentially as a single script
  • With this change, they run in parallel, reported as separate Targets in TestGrid
  • test logs are still all combined into the primary build.log for viewing in one place

Fail Fast

  • if any single unit test shard or system test job fails, the remaining end early to provide quick feedback

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parallel execution for system tests and implements a sharding mechanism for CI jobs, including a new Python script to group packages and updates to the test runner script. Feedback on these changes focuses on improving reliability and safety: first, by using an EXIT trap in .kokoro/system.sh to guarantee cleanup of isolated gcloud configuration directories in case of test failures; and second, by avoiding global toggles of set -e in ci/run_conditional_tests.sh and instead capturing test exit codes using the || operator.

Comment thread .kokoro/system.sh Outdated
Comment on lines +42 to +43
# Isolate gcloud config for parallel execution
export CLOUDSDK_CONFIG=$(mktemp -d)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since run_package_test runs with set -e enabled, any test failure or error will cause the subshell to exit immediately. This prevents the manual cleanup at the end of the function from running, leading to leaked temporary directories in /tmp.

Using an EXIT trap ensures that the temporary directory is reliably cleaned up when the subshell exits, regardless of whether the tests succeeded or failed.

Suggested change
# Isolate gcloud config for parallel execution
export CLOUDSDK_CONFIG=$(mktemp -d)
# Isolate gcloud config for parallel execution
export CLOUDSDK_CONFIG=$(mktemp -d)
trap 'rm -rf "${CLOUDSDK_CONFIG}"' EXIT

Comment thread .kokoro/system.sh Outdated
Comment on lines +93 to +94
# Clean up isolated gcloud config
rm -rf "${CLOUDSDK_CONFIG}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the EXIT trap handling the cleanup of CLOUDSDK_CONFIG automatically, this manual cleanup is redundant and can be safely removed.

Comment on lines +92 to +95
set +e
${test_script}
ret=$?
set -e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Toggling set -e globally can lead to unexpected side effects or mask errors in other parts of the script if not carefully managed. A more idiomatic and safer way to capture the exit status of a command under set -e is to use the || operator, which prevents the shell from exiting on failure without needing to disable set -e globally.

Suggested change
set +e
${test_script}
ret=$?
set -e
ret=0
${test_script} || ret=$?

@daniel-sanche daniel-sanche changed the title [DRAFT] chore: add test sharding [DRAFT] experiment: add test sharding Jun 12, 2026
@daniel-sanche daniel-sanche added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 15, 2026
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 15, 2026
@daniel-sanche

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a package sharding mechanism for CI testing, adding a new Python script ci/get_package_shards.py and updating ci/run_conditional_tests.sh to support package lists, unique coverage files, and fallback sharding. It also configures test coverage reporting for sqlalchemy-spanner and updates coverage requirements for google-cloud-firestore. The review feedback highlights two key improvements in ci/get_package_shards.py: resolving potential path separator mismatches on Windows by avoiding mixed usage of os.path.join and forward slashes, and refining the sharding distribution logic to prevent highly unbalanced shards when the shard count is capped.

Comment thread ci/get_package_shards.py
if not os.path.exists(subdir):
continue
# Use the same sorting as the shell script
pkg_dirs = [os.path.join(subdir, d) + '/' for d in os.listdir(subdir) if os.path.isdir(os.path.join(subdir, d))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using os.path.join and then appending a hardcoded forward slash (/) can result in mixed path separators on Windows (e.g., packages\\google-cloud-bigtable/). Since git diff outputs always use forward slashes, this mismatch will cause the startswith check in get_packages_to_test to fail on Windows environments.

Using a consistent forward slash format for package paths avoids this issue.

Suggested change
pkg_dirs = [os.path.join(subdir, d) + '/' for d in os.listdir(subdir) if os.path.isdir(os.path.join(subdir, d))]
pkg_dirs = [f"{subdir}/{d}/" for d in os.listdir(subdir) if os.path.isdir(os.path.join(subdir, d))]

Comment thread ci/get_package_shards.py
Comment on lines +64 to +72
shard_size = math.ceil(num_packages / num_shards)

shards = []
for i in range(num_shards):
start = i * shard_size
end = min((i + 1) * shard_size, num_packages)
if start >= num_packages:
break
shard_packages = packages[start:end]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using math.ceil(num_packages / num_shards) as a fixed chunk size can lead to highly unbalanced shards when the number of shards is capped at 10. For example, if there are 101 packages, num_shards is capped at 10, and shard_size becomes ceil(101/10) = 11. This results in 9 shards of size 11 and 1 shard of size 2, which is highly unbalanced and defeats the purpose of efficient parallel sharding.

Using a contiguous even distribution formula ensures that the difference in package count between any two shards is at most 1.

Suggested change
shard_size = math.ceil(num_packages / num_shards)
shards = []
for i in range(num_shards):
start = i * shard_size
end = min((i + 1) * shard_size, num_packages)
if start >= num_packages:
break
shard_packages = packages[start:end]
shards = []
for i in range(num_shards):
start = (i * num_packages) // num_shards
end = ((i + 1) * num_packages) // num_shards
shard_packages = packages[start:end]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants