For the processing units that reside on the same CPU core, communication typically occurs through a shared L1 cache, with a latency of 1 to 2 cycles.
For processing units that do not reside on the same CPU core but reside on the same chip, communication typically occurs through a shared L2 cache, with a latency of 10 to 20 cycles.
Processing units that reside on separate chips communicate either by sharing memory or through a cache-coherence protocol both with an average latency of hundreds of cycles.