For a 3D mesh with many tets and prisms (thus a dual mesh with many neighbors to each point) LU-SGS costs about 2 matrix-vector products.
LU-SGS, as the name implies, does one pass over the lower entries of the matrix and then one over the upper entries.
So one would naively expect it to cost about the same as one product.
I think the issue is that since there are many neighbors, the CPU falsely detects (via the hardware prefetcher) that the entire matrix is being used and reads the whole thing from memory for each pass.
The ideal solution would probably be to switch to the LDU format (used by OpenFOAM), but I might try to confuse the prefetcher so it doesn't read the entire matrix to see if that can get close to the "speed of light" for this (1 product).
For a 3D mesh with many tets and prisms (thus a dual mesh with many neighbors to each point) LU-SGS costs about 2 matrix-vector products.
LU-SGS, as the name implies, does one pass over the lower entries of the matrix and then one over the upper entries.
So one would naively expect it to cost about the same as one product.
I think the issue is that since there are many neighbors, the CPU falsely detects (via the hardware prefetcher) that the entire matrix is being used and reads the whole thing from memory for each pass.
The ideal solution would probably be to switch to the LDU format (used by OpenFOAM), but I might try to confuse the prefetcher so it doesn't read the entire matrix to see if that can get close to the "speed of light" for this (1 product).