Fix cross-test pollution from engineered homeservers re-using the same server names (fix TestPartialStateJoin flakes)#880
Draft
MadLittleMods wants to merge 7 commits into
Conversation
MadLittleMods
commented
Jun 12, 2026
Comment on lines
+565
to
+575
| // We use this sequential port scan strategy over guess and check with an OS-assigned | ||
| // port (by using `:0`) as it's more efficient. The OS may recycle and re-use freed | ||
| // ports meaning we could regress to O(n^2) behavior trying to search for each new | ||
| // port we want to find. | ||
| // | ||
| // Using `:0` means an unused port is automatically picked for us (could be random, | ||
| // could be the next sequential unused port, we don't know). Ideally, we could ask for | ||
| // the next unused port after X to avoid a bunch of work. When using using `:0`, the | ||
| // pathological case that is O(n^2) is if OS hands back next lowest unused port | ||
| // sequentially which would mean we would have to probe and hold each listener until | ||
| // we finally got something new. |
Collaborator
Author
There was a problem hiding this comment.
If you want to compare to the :0 solution, see the implementation in 820b428
I think this is the better way to go.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem are we trying to solve?
This PR is tackling this problem:
This problem occurs because as an optimization we try to share the
deploymentacross many tests. To be clear, as a dumb-simple solution, if we created a newdeploymentfor each test, we wouldn't run into this issue (but then the tests would be "slow" again).What does this PR do?
Spawning from #878 (comment) where I originally thought we could prevent stray requests to a server just from the federation request authentication but turns out, it's already doing the right thing. The only thing we can do is compare the
destinationto our server name whichVerifyHTTPRequestis already doing. And since the crux of this whole thing is that two Complement engineered servers can use the same recycled port and have identical server names (host.docker.internal:port), it doesn't help us here. The rest of the public key stuff that is part of federation requests is just for the origin server to sign itself with, for authenticity.Instead, this PR takes an alternative route where we instead never re-use the same
server_namefor a Complement engineered homeserver. This means that even if the real homeserver reaches out to the torn-down engineered homeservers, it just hits 'connection refused' instead of a live, unrelated server. No cross-test pollution.In the context of
tests/msc3902/federation_room_join_partial_state_test.go, it means the engineered homeserver will never see at.Errorf("Received unexpected PDU", ...)problem.How do we currently deal with this?
Currently, we try to workaround the stray request problem by making sure the engineered homeserver leaves any rooms it shares with the real homeserver. This means the real homeserver has no logical reason to contact the other server but it's still technically possible that it might reach out because of some queued federation traffic, etc. I find this approach/pattern pretty bad (especially in the way it's currently implemented). See the "Fix" explanation on #878 for more info.
See all of the logic around
partialStateJoinResult.DestroyandWithWaitForLeaveto see how messy this gets and all of the logic that we get to remove because of this new approach.Dev notes
:0behavior with https://pkg.go.dev/net#ListenTodo
tests/msc3902/federation_room_join_partial_state_test.gowhich is no longer necessaryPull Request Checklist