Skip to content
This repository was archived by the owner on Jul 13, 2025. It is now read-only.

Fork Sync: Update from parent repository#36

Open
github-actions[bot] wants to merge 1678 commits into
MultiMx:mainfrom
tailscale:main
Open

Fork Sync: Update from parent repository#36
github-actions[bot] wants to merge 1678 commits into
MultiMx:mainfrom
tailscale:main

Conversation

@github-actions

Copy link
Copy Markdown

No description provided.

creachadair and others added 30 commits May 12, 2026 12:01
…onnectivity (#19699)

Add new clientmetric counters for establishing contact with peers while using
cached network map data. To do this, instrument the magicsock.Conn with a bit
to indicate whether its peer data came from a cached netmap. If so, there are
two conditions we will count as establishing connectivity to a peer:

  - Receipt of a CallMeMaybe from a peer via disco.
  - Establishing a valid endpoint address for a peer.

In vmtest, add Env.ClientMetrics to scrape metrics from the specified node.
Use this to check that counters were updated in caching tests.

Updates tailscale/projects#13
Updates #12639

Change-Id: Ie8cf3244ac8af4f5bcfe4d0d944078da2ba08990
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Fixes #12778

Change-Id: If9f8b299cef0cb68f93b344845b5c6a5b7554d2c
Signed-off-by: DeedleFake <deedlefake@users.noreply.github.com>
…services

Adds two new cap resolution methods alongside the existing PeerCaps:

PeerCapsForService(src netip.Addr, svcName tailcfg.ServiceName) resolves
the service name to its VIP addresses via the node's service IP mappings
and returns caps scoped to that service. Exposed on /v0/whois via the
svc_name query parameter and on client/local.Client as WhoIsForService.

PeerCapsForIP(src, dst netip.Addr) resolves caps against an arbitrary
destination IP. Exposed on /v0/whois via the svc_addr query parameter
and on client/local.Client as WhoIsForIP.

svc_name takes priority over svc_addr when both are present. Invalid
values for either return 400. The existing PeerCaps/WhoIs path is
unchanged: without a service parameter, WhoIs returns only host-level
caps.

Updates tailscale/corp#41632

Signed-off-by: Adriano Sela Aviles <adriano@tailscale.com>
Replace the process-global Server.mu lookup in the packet send hot path
with a global hashtriemap mirror of local clientSet entries. The
authoritative clients map remains guarded by Server.mu; clientsAtomic is
only a lock-free fast path for active local clients.

Misses, stale inactive client sets, duplicate accounting, and mesh
forwarding still fall back to lookupDestUncached. This avoids taking
Server.mu for the common local active-client send path, at the cost of
adding one global concurrent map that mirrors Server.clients for local
peers.

The benchmark uses four destination peers. The before run sets
TS_DEBUG_DERP_DISABLE_PEER_HASHTRIE=true to force the old mutex lookup
path; the after run uses the hashtrie fast path.

    goos: linux
    goarch: amd64
    pkg: tailscale.com/derp/derpserver
    cpu: Intel(R) Xeon(R) 6975P-C
                          │    before     │                after                │
                          │    sec/op     │   sec/op     vs base                │
    LookupDestHashTrie-16   176.050n ± 1%   1.904n ± 6%  -98.92% (p=0.000 n=10)

                          │   before   │             after              │
                          │    B/op    │    B/op     vs base            │
    LookupDestHashTrie-16   0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
    ¹ all samples are equal

                          │   before   │             after              │
                          │ allocs/op  │ allocs/op   vs base            │
    LookupDestHashTrie-16   0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
    ¹ all samples are equal

Updates #3560 (very indirectly, historically)
Updates #19713 (as an alternative to that PR)

Change-Id: Ifb72e5c9854ad00e938cd24c6ab9c27312f297e8
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This fixes a log message where ipn/ipnlocal.shouldUseOneCGNATRoute
would claim that an android machines was actually macOS.

Updates #cleanup
Updates #19652

Signed-off-by: Simon Law <sfllaw@tailscale.com>
…19721)

This patch fixes a data race in wgengine/netstack that surfaced while
running both TestTCPForwardLimits and TestTCPForwardLimits_PerClient.
Because these two tests both setup the TS_DEBUG_NETSTACK envknob, a
race happens because netstack.Impl.Close leaked its inject goroutine.
The inject goroutine also reads the TS_DEBUG_NETSTACK envknob, so if
it is still running when the next test starts, then it will break.

This patch also cleans up the tests a bit, ensuring that neither of
them run in T.Parallel. It also adds a T.Cleanup call to clear the
envknob.

Fixes #19720

Signed-off-by: Simon Law <sfllaw@tailscale.com>
Fixes tailscale/corp#40250

Signed-off-by: Fran Bull <fran@tailscale.com>
)

Instead of having two entry points for running natlab tests, start
converting the connectivity tests to use the vmtest framework.

Grid and pair tests have yet to be moved over.

Updates #13038

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
A missing hosts file is not a fatal error. We should log it, but still proceed
and create a new one instead of failing the DNS reconfiguration completely.

Fixes #19733

Signed-off-by: Nick Khyl <nickk@tailscale.com>
Adds a new NoiseRoundTripper field to tsd.Sys
to expose an http.RoundTripper to make requests
over the control plane Noise connection.

This will be used in PAM use cases soon.

Updates tailscale/corp#41800

Signed-off-by: Adriano Sela Aviles <adriano@tailscale.com>
…ns unchanged

Warnables with a non-zero TimeToVisible are only published on the eventbus when
they remain unhealthy long enough to become visible.

However, we still publish a health.Change when a warning that was never visible
(and was never published to the eventbus) becomes healthy.

This PR fixes that and reduces churn when there is no actual state change. In
particular, it avoids unnecessary IPN bus notifications sent to GUI/CLI clients,
captive portal detection, etc.

Updates tailscale/corp#39759 (noticed while working on it)

Signed-off-by: Nick Khyl <nickk@tailscale.com>
Server.clientsAtomic was introduced in 6b72979 as a lock-free
mirror of Server.clients to skip Server.mu on the packet send hot
path. This drops the non-concurrent map and makes all the existing
callers of the old plain map just use the concurrent map, but still
holding Server.mu.

BenchmarkLookupDestHashTrie is unchanged at ~2ns/op.

Fixes #19726

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I0894e4d86914d152b9b5fef969a3184bcb96f678
…etry

Brings Subscriber[T] in line with the same non-generic-core pattern already
applied to SubscriberFunc[T] and Publisher[T]:

  - Renames subscriberFuncCore to subscriberCore and shares it between
    Subscriber[T] and SubscriberFunc[T]. Both typed facades hold a
    *subscriberCore plus their respective per-T delivery state
    (Subscriber: chan T; SubscriberFunc: nothing, the user callback is
    captured in the dispatch closure).

  - The bus's outputs map and subscriber-interface itab key on
    *subscriberCore for both subscriber kinds, so adding a new Subscribe[T]
    call site no longer pays a per-T itab, dictionary, or equality function
    for the subscriber-interface side.

  - Subscribe[T] now hoists the non-generic constructor portion into
    newSubscriberCore (timer setup, core allocation, cached type/typeName,
    unregister method-value), matching SubscribeFunc.

The dispatch loop is intentionally NOT extracted to a non-generic helper for
Subscriber[T], unlike SubscriberFunc[T]. The reason is the typed channel send
'case s.read <- t:' must appear lexically inside the select; the only way to
lift it into a non-generic loop is to bridge typed and untyped via a per-event
goroutine, which costs ~2.7x throughput on BenchmarkBasicThroughput. We keep
dispatchTyped on the generic facade and accept the per-shape stencil cost as
the cheaper alternative.

Symbol-level effect on tailscaled (linux/amd64, measured via
`go tool nm -size`):

  Before:
    (*Subscriber[T]).dispatch
      2 shape stencils:        1,682 + 1,549 = 3,231 B
      3 thin per-T wrappers:   124 B each   =   372 B
      2 deferwrap1 helpers:    62 B each    =   124 B
      total:                                 3,727 B

  After:
    (*Subscriber[T]).dispatchTyped
      2 shape stencils:        1,678 + 1,582 = 3,260 B
      0 per-T wrappers (replaced by closure stored on core)
      2 deferwrap1 helpers:    62 B each    =   124 B
      total:                                 3,384 B

  dispatch path .text delta:                   -343 B (-9.2%)

Per-shape stencils are ~1,600 B (.text body) + ~1,100 B (pclntab) =
~2,700 B each on production tailscaled. The shape count matches before/after
(two distinct GC shapes for the Subscriber[T] event types in this binary).
What changes is that the per-T thin wrappers are eliminated because
Subscriber[T] no longer implements the subscriber interface directly.

Whole-binary section deltas:

  .text:        -2,304 B  (includes the dispatch savings plus other
                            small downstream effects)
  .rodata:        +512 B  (additional closure-type metadata)
  .gopclntab:   -2,981 B  (fewer per-T compiled functions => less metadata)

Stripped tailscaled (linux/amd64): no change at the file level (the savings
fall below the linker's section-alignment boundary). Unstripped builds shrink
by ~2,900 B.

Behavior is unchanged:
  BenchmarkBasicThroughput:       2,161 ns/op,  0 B/op,  0 allocs/op
  BenchmarkBasicFuncThroughput:   2,493 ns/op, 144 B/op, 2 allocs/op
  BenchmarkSubsThroughput:        3,727 ns/op,  0 B/op,  0 allocs/op

Updates #12614

Change-Id: I97918ec68bd2cdb15958bbfd7687592b39663efe
Signed-off-by: James Tucker <james@tailscale.com>
…eck (#19725)

Fix the following issues:

1. Endianness Bug: The nftables runner used hardcoded
   big-endian byte arrays for firewall mark values (0xff0000, etc.), breaking
   bitwise operations on little-endian systems (all x86/x64, ARM). This caused
   connmark save/restore rules to silently fail. Fixed by using
   binary.NativeEndian to generate correct byte order for the host system.

2. Connmark Restore Conditional Check: The connmark restore
   mechanism unconditionally overwrote packet marks, even when Tailscale
   hadn't set any mark bits in conntrack. This destroyed mark bits set by
   other systems (VPNs, policy routing, vendor flags), breaking coexistence.
   Fixed by adding a conditional check to only restore when (ct mark &
   0xff0000) != 0, preventing the worst case of wiping all marks to zero.

Changes:
- util/linuxfw/linuxfw.go: Added nativeEndianUint32() helper and updated
  all mask functions to use native byte order instead of hardcoded bytes
- util/linuxfw/nftables_runner.go: Added conditional check in
  makeConnmarkRestoreExprs() to only restore when ct mark has Tailscale
  bits set; added detailed comment about bit preservation limitations
- util/linuxfw/iptables_runner.go: Added conditional check using -m
  connmark ! --mark to match nftables behavior
- Tests updated: Fixed byte-level regression tests to expect little-endian
  byte sequences and verify the new conditional check

Note: Perfect bit preservation in nftables remains challenging
due to nftables expression VM limitations. The current implementation
prevents the critical case of wiping marks with zero.

Updates #3310
Fixes #11803
Related to #8555

Signed-off-by: Mike O'Driscoll <mikeo@tailscale.com>
The codegen path for map-of-slice-of-pointer fields, skipped
nil-valued entries. That dropped the key from the map.

This broke how dns.Config.Routes uses nil values sentinels.

Fixes #19730
Fixes #19732
Fixes #19746
Fixes #19744

Change-Id: Ic6400227f4ab21b3ca0e8c0eeecf9b83d145a9ab

Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
The label "natlab" is a bit confusing and also used for other things.
Instead, change the trigger label to "run-natlab-tests".

Updates #13038

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
In a lot of places, we construct an error to End a step, then immediately log
it to the governing test as test fatal. Save ourselves a bit of boilerplate by
putting methods on Step for that.

There are a couple cases this doesn't cover, e.g., where we construct the Step
outside a subtest that wants to fail individually, but it helps enough to pay
for its lines.

Updates #13038

Change-Id: I71f9900942962de16609b6b198d3ba13d6958a5f
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
…#19758)

Their version scheme is different, even though the OS is based on
Ubuntu. We need to check Zorin's version numbers to pick the right
APT_KEY_TYPE.

Updates #18925

Signed-off-by: Andrew Lytvynov <awly@tailscale.com>
Add a VM-based natlab test that exercises the peer-relay feature
(feature/relayserver) end-to-end across three Tailscale nodes whose
network topology makes a direct A<->B UDP path impossible: both peers
are behind HardNAT (FreeBSD/pfSense-style endpoint-dependent NAT) with
no port-mapping services, while the relay node is behind One2OneNAT so
its STUN-discovered WAN endpoint is reachable from both peers. The
test enables the relay server via EditPrefs, then waits for an a->b
PingDisco whose PingResult.PeerRelay is set (proving magicsock chose
the peer-relay path, not DERP), and finally asserts that the relay's
DebugPeerRelaySessions LocalAPI reports the session.

The existing TestPeerRelayPing in tstest/integration runs three
tailscaled processes on the loopback interface with no NATs; this new
vmtest covers peer relay through real per-VM kernels and NATs.

To wire control-server capabilities into vmtest, also add a
PeerRelayGrants() EnvOption (sibling of AllOnline,
SameTailnetUser) that flips testcontrol.Server.PeerRelayGrants so the
wildcard packet filter grants tailcfg.PeerCapabilityRelay and
PeerCapabilityRelayTarget; without those caps magicsock won't consider
any peer a candidate relay.

Updates #13038

Change-Id: Ib3440b83ec442da0d3b89ffa48ceea9398ea9062
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Since f343b49 ("wgengine, all: remove LazyWG, use wireguard-go
callback API for on-demand peers"), Reconfig is fully synchronous:
magicConn.UpdatePeers, wgdev.RemovePeer, router.Set, and dns.Set all
return when the work is done, and the peer list is updated under
wgLock before Reconfig returns. So after Reconfig with empty configs,
len(st.Peers) is already 0.

The old loop also waited for st.DERPs to drain to 0, but UpdatePeers
only edits maps; active DERP connections idle out on their own
timeout. The sole caller (LocalBackend.stopEngineAndWait) doesn't
inspect st.DERPs anyway; it just hands the Status to
setWgengineStatusLocked. So the drain-wait was for nothing observable
and could theoretically (or at least appear to readers to) loop
forever holding b.mu. Remove that reader confusion by removing
the backoff loop entirely.

Updates #19759

Change-Id: Ibfac3f0baabcad7604b713c934a8fc37932e0a50
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
…scale CI

cibuild.On() returns true for any CI environment that sets CI=true,
including Alpine Linux's package build CI. TestTsgoRevInCacheKey was
guarded by cibuild.On() (or use of tsgo), so it ran under Alpine's CI
with stock Go, where go.toolchain.rev isn't blended into build cache
keys, and unsurprisingly failed.

Add cibuild.OnTailscaleCI, which keys off GITHUB_REPOSITORY_OWNER to
distinguish tailscale/tailscale's own GitHub Actions CI from arbitrary
downstream CI, and use it in TestTsgoRevInCacheKey.

Fixes #19754

Change-Id: Id31cfe71903a235f1460dca1e2fdf334e3ba1ee5
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: License Updater <noreply+license-updater@tailscale.com>
…ls (#19757)

linuxRouter has two blocks (connmark rules and the CGNAT drop rule) that
gate on cfg.NetfilterMode, the requested config state. This may cause an
error when setNetfilterModeLocked fails, since it may keep assuming this
config is valid.

We now gate both blocks on r.netfilterMode, matching the pattern used by
SNAT, stateful, and loopback paths.

Fixes #19737

Change-Id: Ia6003a082db99c376e662132d725661afbac0ee9

Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Updates tailscale/corp#37904

Change-Id: I09e73b3248b9ddf86dafe33dfb621bd560f6596d
Signed-off-by: Alex Chan <alexc@tailscale.com>
Move the inline CSS and JS into separate files to be more friendly
to Content Security Policies. ServeHTTP is updated to serve these
assets from the '/static/' path.

Updates tailscale/corp#32398

Signed-off-by: Noel O'Brien <noel@tailscale.com>
RouteCheck, which checks that overlapping routers are reachable, is
enabled by default for both tailscaled and tsnet.

Updates #17366
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
The Engine watchdog wrapped every wgengine.Engine method call in a
goroutine with a 45s timeout and crashed the process on timeout. It
was added years ago to surface deadlocks during development, but the
underlying deadlocks have long since been fixed, and even when it did
fire it produced obscure stack traces (from inside the watchdog
goroutine, not the original caller) without buying much.

Audit of userspaceEngine's methods shows none have cyclic locking or
unbounded blocking now that ResetAndStop no longer loops waiting for
DERPs to drain (fa49009). The watchdog is dead weight; remove it
along with the TS_DEBUG_DISABLE_WATCHDOG escape hatch.

Updates #19759

Change-Id: Iba9d718fe1f8718a6631296e336b138c31b99ff1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Issue #19737 ran into a nil pointer dereference, the cause of which was fixed
by #19761. If we end up on this code path with a nil table again, we should
bubble that up as an error (which is logged by the health warning system)
rather than failing catastrophically.

Signed-off-by: Naman Sood <mail@nsood.in>
If the context given to DialContext has a shorter lifetime than the OS
TCP SYN timeout, and TCP SYNs are dropped from the path to the remote,
DialContext would never fall back to try IPv6 after IPv4.

Instead, use the normal happy eyeballs race if there is more than one
address. This does remove the implicit prioritization of IPv4 over IPv6
in cases where there is only a single IPv4 remote address.

Updates #13346

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
A data race in a package matters more than any individual test
result. Two related problems:

1. Where go test's race detector text ("WARNING: DATA RACE" plus
   the goroutine stack traces) lands in JSON output is timing-
   dependent: it can be attributed to a test that ends up reporting
   PASS (e.g. when the racing goroutines outlive the test that
   spawned them and TSan prints during a different test's window).
   testwrapper's main loop only flushes the logs of failed tests,
   so the race report ends up stuck in a passing test's buffer and
   is silently dropped. The race builders just see a bare
   "FAIL\nFAIL\tpkg\ttime".

2. If the failing test in such a package happens to be marked flaky,
   testwrapper retries it. That is the worst possible response to a
   race: the flaky test might not even be the racy code, and a
   second run without the racy goroutines could "succeed" while
   hiding the real bug.

Address both: scan every output line for the race detector's first-
line marker. Track whether the package observed a race at all, on
the pkgFinished testAttempt. When a race was seen, fold every per-
test log buffer into the package-level logs (so the full report
surfaces from the existing pkg-fail flush path), and drop any
flaky-test retry plans for that package so we fail immediately
instead of running another attempt.

Two new tests:
- TestRaceSuppressesFlakyRetry verifies that a flaky test alongside
  a racy test does NOT get retried.
- TestRaceAttributedToPassingTest verifies that a race attributed by
  test2json to a passing test still surfaces in the output.

Also add a corpus of captured raw test binary outputs under
cmd/testwrapper/testdata/, with one subdirectory per scenario,
documenting the six representative shapes that go test -race can
emit (race in test body, race in goroutines that outlive a test,
race forced into a later test, race in TestMain post-m.Run, and a
parallel-tests split-attribution case via a "=== NAME" redirect
line). See its README.md for details.

Fixes #19603

Change-Id: Ifbfcd67fb3b1882c4907bd9cb2d68a8b5a91dd54
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
patrickod and others added 30 commits June 23, 2026 11:51
Pin govulncheck to resolve panics in the most recent version.

Updates #cleanup

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
The watchdog (ipn/ipnlocal/watchdog.go) was abusing PeerForIP with an
invalid netip.Addr as a way to acquire and release the engine's
internal locks for deadlock detection. This does the TODO to break it out
into its own method like all the other similarly named methods.

Splitting this out as a prerequisite for a follow-up rewrite of
PeerForIP itself; not having to preserve the lock-probe overload in
the new implementation keeps that follow-up smaller.

Updates #12542
Updates #cleanup

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I25cbffd11aeb65600d9128845404c4918ef88ead
I'm not keen on us having to deal with the bad side effects of the
autocrlf default, but alas, if it makes things easier.

Fixes #16175
Closes #16176

Signed-off-by: James Tucker <james@tailscale.com>
…ression

Otherwise we may never handshake a new peer relay server endpoint
around remote client restarts and/or disco key rotation.

Updates #20215

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Another baby step toward removing slices of peers from the engine.

getStatus iterated peerSequence (a key snapshot built in Reconfig
from cfg.Peers) and then asked wgdev for each peer's stats; peers
that weren't active in wgdev silently fell out. Iterate active wgdev
peers directly via RemoveMatchingPeers(returnFalse) instead.

Updates #12542

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I3abd348abc30db706db29b3a785179259e48abda
userspaceEngine.PeerForIP read from e.netMap.Peers and
e.lastCfgFull.Peers, both of which go stale when peers arrive via
netmap deltas (which skip Engine.SetNetworkMap and Engine.Reconfig).
Every PeerForIP caller (Engine.Ping, the TSMP disco-key handler,
pendopen diagnostics, tsdial.Dialer.UseNetstackForIP, and
LocalBackend.GetPeerEndpointChanges) would report "no matching peer"
for freshly-added peers.

Fix it the same way SetPeerByIPPacketFunc fixed the outbound packet
hot path: have LocalBackend install a callback that reads the live
nodeBackend. nb.NodeByAddr is built from both SelfNode and Peers
(updateNodeByAddrLocked), so a single lookup covers the common case
with IsSelf set when the matched node ID is SelfNode's. The subnet-
route / exit-node-default-route slow path goes through a new
Engine.PeerKeyForIP that exposes the engine's AllowedIPs BART table
(the same table the outbound packet hot path already consults, with
exit-node selection honored), and resolves the matched key back to a
NodeView via the live nodeBackend.

Updates #12542

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I0d4b0d8997c8e796b7367c46b49b61d4fdc717b0
The logging added in 12188c0 was generating excessive spam in
backend logs. This may have been exacerbated by
tailscale GUI<->backend architecture on certain platforms like
Windows, where the GUI polls for exit node suggestions rather
than listening on the IPN bus.

Change this to log on error or if the current suggestion differs
from the previous suggestion.

Updates tailscale/corp#43691
Updates #20194

Signed-off-by: Amal Bansode <amal@tailscale.com>
Most of our flag descriptions start with a lowercase word (except proper
nouns); fix the handful which do not.

Fixes #20230

Change-Id: I00aaac171254c050ad0b75c2cf8746590c8c4d8f
Signed-off-by: Alex Chan <alexc@tailscale.com>
Add a retry loop with BatchMode=yes to absorb the race window
between Env.Start() returning (when tta reports the tailscale
backend as Running) and cloud-init finishing the user/SSH-key
setup. In CI, the second VM's tta agent has been observed
connecting only a few hundred milliseconds before the test SSHes
in, which is inside the window where /root/.ssh/authorized_keys
hasn't fully landed yet. SSH key auth then fails and ssh(1) falls
back to interactive password prompts (3x), wasting time and
producing a confusing "Permission denied (publickey,password)"
error.

BatchMode=yes makes the client fail fast on auth failure instead
of prompting, and the retry loop handles SSH transport-level
errors (exit code 255) for up to 30 seconds with 500ms backoff.
Remote command non-zero exits still pass through unchanged.

Fixes #20228

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I17f7422e9e27bf7b995f505c0184cbb2b230ed81
Env.Start boots all VM nodes in parallel; each calls
createCloudInitISO -> ensureDebugSSHKey concurrently. When
/tmp/vmtest_key doesn't yet exist, the first goroutine creates it
with os.WriteFile, which opens with O_CREATE|O_TRUNC and briefly
leaves the file existing-but-empty between the open and the
subsequent write. A concurrent goroutine that hits that window
sees ReadFile succeed with zero bytes, then fails ssh.ParsePrivateKey
with "ssh: no key found", causing boot to fail with:

  boot: creating cloud-init ISO: parse /tmp/vmtest_key: ssh: no key found

Observed in CI on TestSiteToSite (3 nodes). Wrap the function in
a package-level Mutex so the first caller fully writes the key
before any other caller reads it.

Updates #20228

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Ie6399dcba0c397bb8041931d3de1c6063a11c568
tsdial.Dialer.SetNetMap rebuilt an O(n peers) map of MagicDNS names on
every netmap change. As we move toward per-peer incremental deltas,
this becomes quadratic. This removes it and replaces it with
SetResolveMagicDNS, a callback into LocalBackend that looks up
hostnames from nodeBackend's new nodeByName index (populated alongside
nodeByAddr/nodeByKey on both full and delta paths). The index stores
both FQDNs and short names as keys.

This is the same treatment applied to netlog (8f21045), wglog
(988b090), and drive (1d69894): stop pushing *netmap.NetworkMap
into subsystems and instead have them pull from LocalBackend's live
data via callbacks.

Updates #12542

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I24557ab0c8a27636e08e4779bcfd3ec633db0a78
Add zizmor GitHub Actions linting on changes to .github/workflows.

Updates tailscale/corp#28760

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
…20199)

Router.Set reconciled tailscale0's addresses only against the in-memory
r.addrs map, which starts empty each run. After a restart the kernel can
still hold the addresses a previous profile put on tailscale0. With no
record of them, Set never removed them, leaving two tailnets' CGNAT
addresses on the interface. That broke connectivity, because the kernel
could source traffic from the wrong IP.

Fix this by scanning the addresses actually on the interface and, after
reconciling the desired set, removing any in Tailscale's CGNAT/ULA ranges
that aren't in the config. Non-Tailscale addresses are never touched,
and IPv6 addresses are skipped when IPv6 is unavailable, since delAddress
no-ops there. To avoid a netlink dump on every Set, the scan runs only on
the first Set and when the desired address set changes.

This also needs the iptables DelLoopbackRule to tolerate a missing rule:
an orphan left by a previous instance never went through AddLoopbackRule
here, and iptables (unlike nftables) errors when deleting an absent
rule, which would otherwise block the address delete.

Fixes #19974

Signed-off-by: Brendan Creane <bcreane@gmail.com>
The primary purpose is that return packets from the target app get
properly SNATed on connectors with --tun=userspace-networking, matching
the NAT behavior in the kernel tun path.

This is also necessary but not sufficient for clients of connectors in
userspace networking mode. The hook will DNAT MagicIPs, but won't
actually be sent MagicIPs until conn25 app connector DNS works with
userspace networking.

Fixes tailscale/corp#43201

Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
The engine only used the netmap to look up self addresses and the
self node's primary routes, so pass it the self node directly
rather than the whole netmap.

Updates #12542

Change-Id: I13c0028eed65d2177baf4cf6c449f5e441845a18
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
setWebClientAtomicBoolLocked and setDebugLogsByCapabilityLocked
each only need the node capabilities to decide what to do, so
take a set.Set[tailcfg.NodeCapability] directly as part of
getting rid of netmap.NetworkMap.

Updates #12542

Change-Id: If7c30b6354fd42dfe82ed6d2e2fe3439de401315
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
No code changes needed; this is to rule out cmpver as the source of any
version-comparison issues.

Updates #20238

Change-Id: Ib8765dd042e994549d9e2c03859a5f769a856704
Signed-off-by: Alex Chan <alexc@tailscale.com>
364b952 switched containerboot to partial netmap fetching, but
stopped refreshing `DNS.ExtraRecords`, so Tailscale Services created
after pod boot were invisible to resolveTailnetFQDN. To fix we watch
for SelfChange ipn bus notifies, and refetch dns-config via LocalAPI
to get a fresh set of `DNS.ExtraRecords`.

Fixes #20233

Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
… receive extensions" (#20257)

* Revert "control/controlclient: continue map poll during key expiry to receive extensions"

This reverts commit 6a822dc. This commit
has caused test failures in the corp repo by unexpected changing the login
behaviour when nodes have a valid node key.

Updates tailscale/corp#43705
Updates #19326

Signed-off-by: Alex Chan <alexc@tailscale.com>

* Revert "tsnet: test key extension after server restart"

This reverts commit 3172013. This test
relies on changes in 3172013, which is
also being reverted because it causes test failures in corp.

Updates tailscale/corp#43705
Updates #19326

Signed-off-by: Alex Chan <alexc@tailscale.com>

---------

Signed-off-by: Alex Chan <alexc@tailscale.com>
…20169)

This patch adds a new `client-side-reachability-routecheck` node
attribute to allow admins to selectively enable background routecheck
probing on trial nodes. The current implementation is still
experimental.

It adds the routecheck.IsEnabled helper to check for the new
`client-side-reachability-routecheck` node attribute alongside the
existing `client-side-reachability` node attribute in this node’s self
capabilities. This allows administrators to turn on and off this
feature by editing the policy file.

It adds the `TS_DEBUG_FORCE_CLIENT_SIDE_REACHABILITY_ROUTECHECK`
environment variable which can be set to override the policy file.
When set to `true`, it forcibly enables this feature. And when set to
`false`, it forcibly disables it.

Updates #17366
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
Occasionally CI jobs will flake because downloading from GitHub fails.
Allow retrying up to 3 times to reduce CI flakiness.

Updates #cleanup

Change-Id: Ib019e89ac74b81d78f71a40099b20ff60014a81f
Signed-off-by: Alex Chan <alexc@tailscale.com>
…err (#19968)

On optimistic lock error, requeue the event after a short duration.

Resolves a case where a failure to acquire an optimistic lock on the
dnsrecords configmap will cause the operator to drop a reconcile event
and leave the configmap in an undesirable state.

Updates #19946

Signed-off-by: Alex Freestone <freestone.alex@gmail.com>
updates tailscale/corp#44019

WebClient is very useful for remote management
on tvOS (which cannot do ssh).   Let's include it there.
Minimal corresponding tailscale/corp changes to follow
to add UI to set the required prefs.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
We stopped reading this field nearly two years ago, with a TODO comment
to remove it sometime in 2025.

It is now 2026.

Updates #12058

Change-Id: I8ddf1c2e4c3c428e8d45a6491d3899368ec52c30
Signed-off-by: Alex Chan <alexc@tailscale.com>
…nsion

The ACME serialization mutex (acmeMu) was a package-level global, and
several ACME-related fields lived on LocalBackend even though the
cert code is conditional and not linked into every binary. With
multiple tsnet.Servers in one process (each its own LocalBackend),
a process-wide acmeMu also serialized unrelated backends.

Introduce a new feature/acme extension that owns the per-LocalBackend
ACME/cert state in an ipnlocal.CertState value:

  - acmeMu, renewMu, renewCertAt (previously package globals)
  - pendingACMETLSALPNCerts, pendingCertDomains{,Mu},
    getCertForTest, certRefreshCancel (previously LocalBackend
    fields, only meaningful when ACME was compiled in)

ipnlocal/cert.go now reaches the state through b.certState(), which
is routed by a feature.Hook installed at init by feature/acme. The
CertState type lives in ipnlocal so cert.go can access its fields
directly without a method explosion; the extension in feature/acme
constructs and owns it.

This is a baby step. The end goal is for the entire cert/ACME code
to live in feature/acme, with ipnlocal only retaining whatever thin
hooks the rest of LocalBackend needs to call into it. The current
split (CertState and most of cert.go in ipnlocal, extension wrapper
in feature/acme) is a deliberately temporary middle ground that
keeps this PR small while making the next moves mechanical.

The package is named feature/acme to match the existing HasACME /
ts_omit_acme naming. condregister/maybe_acme.go wires it in for
non-js builds.

Updates #12614
Updates #20248
Updates #20249

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I520909f24ad11a9622ef33c2290fe36ad44d6f71
GitHub's built-in CODEOWNERS only supports a hard "block until a team
member reviews" rule, with no way to leave an audit trail when the
requirement is intentionally bypassed. Move review enforcement to
palantir/policy-bot (https://github.com/palantir/policy-bot) running
at https://policybot.corp.ts.net, which lets us express the same
tailcfg/ -> control-protocol-owners rule plus an explicit override:
any other @tailscale/dev member can post

    policybot-override: <reason>

as a PR comment and that comment counts as their approval, with the
reason recorded in the PR conversation as a permanent audit trail.

CODEOWNERS is kept as a one-screen comment so anyone landing on it
expecting the old behavior is directed to .policy.yml.

Updates tailscale/corp#13972

Change-Id: I2dc3619c498d4c4a6decae29aa123f6d67905eed
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The override comment didn't work as expected.
(I'll be updating the policytest package to handle this)

Updates tailscale/corp#13972

Change-Id: Ic5c16eed09c8cb5fa8dab37d43cf05f8dfa75d49
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
prometheus/common v0.66/v0.67 introduced a mandatory
model.ValidationScheme on expfmt.TextParser as part of
prepping for UTF-8 metric/label names in Prometheus 3.0. The
zero value is intentionally UnsetValidation, which panics on
the first call to IsValidMetricName / IsValidLabelName with

  Invalid name validation scheme requested: unset

so the long-standing "var parser expfmt.TextParser" pattern
crashes at runtime. Several big downstreams have hit the same
sharp edge:

  thanos-io/thanos#8823
  grafana/loki#21401

Switch our two callers (parseMetrics in tsnet's
TestUserMetricsByteCounters and the client-metrics scraper in
tstest/natlab/vmtest) to the new expfmt.NewTextParser
constructor with model.LegacyValidation. LegacyValidation
matches the classic ASCII metric/label naming rules that
tailscaled's exporter uses today; if and when we ever emit a
metric with a UTF-8 name, we can revisit.

Goes to v0.69.0 (the latest at the time of writing) rather
than v0.67.5 so we pick up the unrelated security fixes for
cross-host redirects.

Done in advance so a follow-up change can pull in
github.com/tailscale/policybottest (which depends on
palantir/policy-bot, which transitively requires
prometheus/common at v0.67+) without dragging this debugging
into that PR.

Updates tailscale/corp#13972

Change-Id: I4b37db9ad3bebef1a32d9020bf6f8790bab25336
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a .policy-tests.yml file with tests exercising the policy
that was just landed: the tailcfg/ control-protocol-owners gate,
the "policybot-override:" comment escape hatch (including
defaults-regression guards so the override rule does not
silently accept a normal review or a 👍 comment), and the
always-on "any tailscale/dev review" baseline.

Updates tailscale/corp#13972

Change-Id: I42afb06b0771658c803512cb5de4701450c8a704
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.