Skip to content

Fix cross-test pollution from engineered homeservers re-using the same server names (fix TestPartialStateJoin flakes)#880

Draft
MadLittleMods wants to merge 7 commits into
mainfrom
madlittlemods/fix-test-partial-state-join-cleanup
Draft

Fix cross-test pollution from engineered homeservers re-using the same server names (fix TestPartialStateJoin flakes)#880
MadLittleMods wants to merge 7 commits into
mainfrom
madlittlemods/fix-test-partial-state-join-cleanup

Conversation

@MadLittleMods

@MadLittleMods MadLittleMods commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

What problem are we trying to solve?

This PR is tackling this problem:

Many of the faster joins test flakes are due to the homeserver under test failing to contact Complement homeservers after they have been torn down. When this happens, subsequent tests can fail if they use a Complement homeserver that happens to have the same hostname:port as one which the homeserver under test has previously marked as offline.

-- #626

This problem occurs because as an optimization we try to share the deployment across many tests. To be clear, as a dumb-simple solution, if we created a new deployment for each test, we wouldn't run into this issue (but then the tests would be "slow" again).

What does this PR do?

Spawning from #878 (comment) where I originally thought we could prevent stray requests to a server just from the federation request authentication but turns out, it's already doing the right thing. The only thing we can do is compare the destination to our server name which VerifyHTTPRequest is already doing. And since the crux of this whole thing is that two Complement engineered servers can use the same recycled port and have identical server names (host.docker.internal:port), it doesn't help us here. The rest of the public key stuff that is part of federation requests is just for the origin server to sign itself with, for authenticity.

destination: [Added in v1.3] the server name of the receiving server. This is the same as the destination field from the JSON described in step 1. For compatibility with older servers, recipients should accept requests without this parameter, but MUST always send it. If this property is included, but the value does not match the receiving server’s name, the receiving server must deny the request with an HTTP status code 401 Unauthorized.

-- https://spec.matrix.org/v1.18/server-server-api/#authentication

Instead, this PR takes an alternative route where we instead never re-use the same server_name for a Complement engineered homeserver. This means that even if the real homeserver reaches out to the torn-down engineered homeservers, it just hits 'connection refused' instead of a live, unrelated server. No cross-test pollution.

In the context of tests/msc3902/federation_room_join_partial_state_test.go, it means the engineered homeserver will never see a t.Errorf("Received unexpected PDU", ...) problem.

How do we currently deal with this?

Currently, we try to workaround the stray request problem by making sure the engineered homeserver leaves any rooms it shares with the real homeserver. This means the real homeserver has no logical reason to contact the other server but it's still technically possible that it might reach out because of some queued federation traffic, etc. I find this approach/pattern pretty bad (especially in the way it's currently implemented). See the "Fix" explanation on #878 for more info.

See all of the logic around partialStateJoinResult.Destroy and WithWaitForLeave to see how messy this gets and all of the logic that we get to remove because of this new approach.

Dev notes

:0 behavior with https://pkg.go.dev/net#Listen

Todo

  • Remove cleanup from tests/msc3902/federation_room_join_partial_state_test.go which is no longer necessary

Pull Request Checklist

Comment thread federation/server.go
Comment on lines +565 to +575
// We use this sequential port scan strategy over guess and check with an OS-assigned
// port (by using `:0`) as it's more efficient. The OS may recycle and re-use freed
// ports meaning we could regress to O(n^2) behavior trying to search for each new
// port we want to find.
//
// Using `:0` means an unused port is automatically picked for us (could be random,
// could be the next sequential unused port, we don't know). Ideally, we could ask for
// the next unused port after X to avoid a bunch of work. When using using `:0`, the
// pathological case that is O(n^2) is if OS hands back next lowest unused port
// sequentially which would mean we would have to probe and hold each listener until
// we finally got something new.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to compare to the :0 solution, see the implementation in 820b428

I think this is the better way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant