FACULTY OF INFORMATICS Masaryk University PA039: Supercomputer Architecture and Intensive Computing Parallel computers Luděk Matýska Spring 2024 Luděk Matýska • Parallel computers • Spring 2024 1/71 FACULTY OF INFORMATICS Masaryk University Parallel computers ■ SmaLL-scaLe multiprocessing ■ 2-several hundreds of cores (ten of processors) ■ mostly SMP (shared memory systems) ■ Large-scale multiprocessing ■ from hundreds to millions of cores (processors) ■ Most often distributed memory Luděk Matýska • Parallel computers • Spring 2024 2/71 FACULTY OF INFORMATICS Masaryk University Parallel computers (II) ■ Architecture ■ Single Instruction Multiple Data, SIMD ■ Multiple Instruction Multiple Data, MIMD ■ Programming models ■ Single Program Multiple Data, SPMD ■ Multiple programs Multiple Data, MPMD ■ Concurrent, Parallel, Distributed ■ Concurrent: A single program with multiple tasks in progress ■ Parallel: A single program with multiple task closely cooperating ■ Distributed: Several programs (loosely) cooperating Luděk Matýska • Parallel computers • Spring 2024 3/71 FACULTY OF INFORMATICS Masaryk University Architecture - SIMD ■ ALL processors synchronized ■ ALL performing the same instruction in any given time ■ Analogy of vector processors ■ SimpLe processors ■ SimpLe programming modeL ■ but difficult programming (esp. for simpLe scaLar operations) Luděk Matýska • Parallel computers • Spring 2024 4/71 FACULTY OF INFORMATICS Masaryk University Vector processor ■ Processor able to work directly with vectors of data ■ vector is a data type of the underlying instruction set ■ Cray introduced even vector registers (otherwise direct work with the memory) ■ Vector Load/Store ■ "composing" vector from different memory words/areas ■ vector of registers that keep addresses of memory words with actual data ■ "to localize" data for further processing ■ in practice the scather/gather operations directly over the main memory Luděk Matýska • Parallel computers • Spring 2024 5/71 FACULTY OF INFORMATICS Masaryk University Vector processor ■ Memory subsystem ■ as a default does not work with caches ■ interleaved (banked) memory ■ several concurrent operations over the main memory ■ it has higher throughput than use of caches esp. when the data are "randomly" scattered over the main memory (random access to data) Scalar Data Bus 256 x 32 bits Instruction cache Scalar CPU Vector Unit FPU Add Units VLR VMR Vector Register File FPU Multiply Units Vector Mem Interface (VMI) Eight banks data memory Another eighi banks data memory Luděk Matýska • Parallel computers • Spring 2024 6/71 FACULTY OF INFORMATICS I Masaryk University Architecture - MIMD ■ FuLLy asynchronous system ■ Individual processors fully independent ■ No special production needed (off-the-shelf) ■ Advantages ■ Higher flexibility ■ At least in theory higher efficiency ■ Disadvantages ■ Explicit synchronization ■ Difficult programming (easy race condition) Luděk Matýska • Parallel computers • Spring 2024 FACULTY OF INFORMATICS Masaryk University Communication models ■ Shared Memory Architecture ■ Message passing Luděk Matýska • Parallel computers • Spring 2024 8/71 FACULTY OF INFORMATICS Masaryk University Shared Memory Architecture ■ Memory separated from processors ■ Uniform access to the memory ■ Bus as the easiest interconnect ■ Cheap interprocess/thread communication ■ Complex overlap of processing and communication (active waiting) Luděk Matýska • Parallel computers • Spring 2024 9/71 FACULTY OF INFORMATICS Masaryk University Message Passing ■ Each processor "visible" ■ Each processor has its own memory ■ Explicit communication - message passing ■ High communication cost (exchange of data) ■ More easier overlap of processing and communication Luděk Matýska • Parallel computers • Spring 2024 10/71 FACULTY OF INFORMATICS Masaryk University Hybrid systems ■ Nonuniform memory access architecture (NUMA) ■ Cache-onLy memory access architecture (COMA) ■ Distributed shared-memory (DSM) Luděk Matýska • Parallel computers • Spring 2024 11/71 FACULTY OF INFORMATICS Masaryk University Non-uniform memory access ■ Access to different memory addresses takes different time ■ Provides for higher scalability ■ Potentially Lower throughput ■ Cache memory coherence problem ■ ccNUMA - cache coherent NUMA Luděk Matýska • Parallel computers • Spring 2024 12/71 FACULTY OF INFORMATICS Masaryk University Cache only memory access ■ NUMA but resembling cache memory behavior ■ Data moves close to the processors that uses them ■ No hierarchy Like with caches ■ System must check that the last copy remains (can't delete it) ■ Experimental ■ Software-based hybrid NUMA-COMA implementation provided by ScaleMP ■ A shared-memory multiprocessor system on top of a cluster of commodity servers Luděk Matýska • Parallel computers • Spring 2024 13/71 FACULTY OF INFORMATICS Masaryk University Distributed shared-memory ■ Distributed system - cluster ■ Local memory on each node ■ Remote memory on other nodes "Fiction" of a single extended memory ■ Hardware solution ■ Usually based on virtual memory principles (move pages between nodes) ■ Transparent ■ Software solution ■ Library ■ Non-transparent, programmer must modify the program (call of proper APIs from the library) ■ ScaleMP hybrid Luděk Matýska • Parallel computers • Spring 2024 14/71 FACULTY OF INFORMATICS Masaryk University Cache Memory Coherence ■ Reasons for cache miss: ■ Compulsory miss: 1st access to data ■ Capacity miss: insufficient capacity ■ Conflict miss: different memory areas mapped to the same cache row ■ Coherence miss: different data in different caches ■ Multiprocessors exposed to the Last case ■ But it can happen in a single processor case, too (how?) Luděk Matýska • Parallel computers • Spring 2024 15/71 FACULTY OF INFORMATICS Masaryk University Invalidation ■ Reaction on content change in remote (cache) memory ■ Row in the actual ("snooping") cache memory invalidated ■ If the same row is needed Later, it is retrieved from the memory (again) Luděk Matýska • Parallel computers • Spring 2024 16/71 FACULTY OF INFORMATICS Masaryk University Update ■ The cache memory row is updated immediately ■ If data are needed (again), they are already in the cache ■ Drawbacks ■ False sharing ■ Row includes more words ■ High load on the bus (broadcasting interconnect) ■ Invalidation and Update are equivalent performance-wise ■ Unless a specific memory access pattern is used Luděk Matýska • Parallel computers • Spring 2024 17/71 FACULTY OF INFORMATICS Masaryk University Coherence Cache Miss solutions ■ Cache memory must be aware of a change elsewhere ■ Broadcast based protocols ■ Directory based protocols Luděk Matýska • Parallel computers • Spring 2024 18/71 FACULTY OF INFORMATICS Masaryk University Snoopy cache ■ Broadcast based protocol ■ Communication network with a "natural" broadcast ■ Each processor foLLows/watches all memory accesses Luděk Matýska • Parallel computers • Spring 2024 19/71 FACULTY OF INFORMATICS Masaryk University Directory based protocols ■ Snoopy protocol based on broadcast ■ Not usable for more complex interconnect networks ■ Not scalable ■ Solution: Reduction of actively "touched" caches - Directories ■ Tag at each memory block ■ Cache memory with a copy of such a block explicitly references the tag ■ Special exclusivity tag (writing) Luděk Matýska • Parallel computers • Spring 2024 20/71 FACULTY OF INFORMATICS Masaryk University Directory based protocols ■ Three based schemas ■ Fully mapped directories ■ Limited directories (partially mapped) ■ Chained directories ■ We will compare them based on the following features ■ Size of the additional memory needed ■ Number of necessary instructions/steps (latency introduced) Luděk Matýska • Parallel computers • Spring 2024 21/71 FACULTY OF INFORMATICS Masaryk University Fully mapped directories ■ Each memory block is able to directly reference all caches (processors) simultaneously ■ Bit vector of copies ■ If a bit is set, the corresponding cache keeps a copy of the data block ■ Exclusivity tag ■ One per a block ■ Writing can be performed on one processor (one cache) only ■ Additional tags for each block in each cache ■ Validity tag ■ Exclusivity tag Luděk Matýska • Parallel computers • Spring 2024 22/71 FACULTY OF INFORMATICS I Masaryk University Limited directories ■ FuLL directories rather expensive (Long bit vectors) ■ Additional memory needed: PM/B ■ P number of cache memories (processors) ■ M size of the main memory ■ B block size ■ Cumulative capacity of all caches usually smaller that the size of the main memory ■ Data block are usually not extensively shared ■ Most directory entries contain zeros ■ Solution: Use of direct references ■ However, one bit per cache is not enough Luděk Matýska • Parallel computers • Spring 2024 23/71 FACULTY OF INFORMATICS Masaryk University Limited directories ■ Set of pointers to the caches ■ Dynamic allocation as needed ■ Features ■ Number of bits per pointer: log2 P ■ Number of pointers in the pointer pool: k ■ More memory efficient than directly mapped if k < j^-p ■ Invalidation information sent only to caches keeping the copy of changed data Luděk Matýska • Parallel computers • Spring 2024 24/71 FACULTY OF INFORMATICS Masaryk University Overflow ■ If the pointer's pool is exhausted ■ Too many shared blocks ■ Possible reactions ■ Invalidation of all shared blocks (broadcast, Dir/B) ■ One entry selection (even random) and invalidation (no broadcast, Dir/NB) Luděk Matýska • Parallel computers • Spring 2024 25/71 FACULTY OF INFORMATICS Masaryk University Other modifications ■ Coarse-vector (Dir/CVr) ■ r is the size of a region (more caches/processors)je velikost regionu (vice procesoru), which corresponds to one bit (i.e. several caches/processors represented by one entry) ■ see the interconnect architectures ■ Switch of interpretation (whether a single processor or a region) as a result of overflow ■ Limited broadcast to aLL processors in the region ■ LimitLESS: software interrupt in case of overflow ("program decides") Luděk Matýska • Parallel computers • Spring 2024 26/71 FACULTY OF INFORMATICS Masaryk University Chained protocols ■ Cache-Based Linked-List ■ OnLy one pointer per memory block ■ Other pointers part of the cache memories ■ The cache memories (their blocks) are "chained" ■ Advantages ■ Memory footprint minimization ■ Drawbacks ■ Comples protocol ■ More communication ("from cache to cache"; more than theoretically necessary) ■ Write has higher latency (must pass through the whole chain) Luděk Matýska • Parallel computers • Spring 2024 27/71 FACULTY OF INFORMATICS Masaryk University Hierarchical directories ■ Usable in systems with multiple buses (n general broadcast supporting interconnects) ■ Hierarchy of caches ■ Higher hierarchy at each bus interconnect ■ Higher memory requirements ■ Higher hierarchy Level must keep copies of shared blocks from lower levels ■ No need of fast "pointer" memory ■ In principle a hierarchy of snoopy protocols with special extensions Luděk Matýska • Parallel computers • Spring 2024 28/71 FACULTY OF INFORMATICS Masaryk University Scalability ■ No single definition ■ Most used definition: Scalable is such a system for which the following holds ■ Performance grows linearly with price ■ The ratio Price/Performance is fixed Both are equivalent, but usually only one condition is explicitly used ■ Alternative parameter - Scalability Extent ■ Performance change as a result of a processor addition ■ Price change as a result of a processor addition ■ Rational extent of number of processors considered ■ e.g. the Price/Performance ratio is not constant but within a defined interval (of number processors) the changes are small (and much higher outside the interval) Luděk Matýska • Parallel computers • Spring 2024 29/71 FACULTY OF INFORMATICS Masaryk University Speedup S(N) = Ideal speedup means W(l) Texec(n) Tcomp(X) "I" Tcomm Tcomm (W) Tcomp (N) = Tcomp(l)/N Tcomm (W) — Tcomm (1)/N Luděk Matýska • Parallel computers • Spring 2024 FACULTY OF INFORMATICS Masaryk University Speedup - commentary ■ Theoretical feature, reality depends on the application ■ Different values for different applications (on the same system) ■ Amdahl Law - extent of possible parallelization ■ Parallelizable and serial part of the task ■ Parallelizability has its limits Luděk Matýska • Parallel computers • Spring 2024 31/71 FACULTY OF INFORMATICS Masaryk University Extensible interconnecting networks ■ ParaLLeL systems must include interconnecting network ■ It influences behavior of the parallel system ■ Ideal extensible network ■ Low cost growing linearly with the number of processors (N) ■ Minimal latency independent of N ■ Throughput grows linearly with N Luděk Matýska • Parallel computers • Spring 2024 32/71 FACULTY OF INFORMATICS Masaryk University Network properties ■ Three basic components ■ Topology ■ Switching (how data moves between nodes) ■ Routing (how a path is computed) Luděk Matýska • Parallel computers • Spring 2024 33/71 FACULTY OF INFORMATICS Masaryk University Interconnecting networks ■ We distinguish the following parameters ■ Network size - number of nodes N ■ Node degree d (number of edges from a node) ■ Network radius D ■ Longest shortest path ■ Bisection width B ■ Network redundancy A ■ Minimal number of links that must be removed for the network to become split into two disconnected parts ■ Cost C ■ Number of links in the network Luděk Matýska • Parallel computers • Spring 2024 34/71 FACULTY OF INFORMATICS Masaryk University Bisection width ■ Bisection width ■ Minimal number of links that must be removed for the network to split into two same size parts ■ Bisection bandwidth ■ Commutative throughput of all removed links ■ Ideal properties: ■ Bisection bandwidth per process is constant Luděk Matýska • Parallel computers • Spring 2024 35/71 FACULTY OF INFORMATICS Masaryk University Topology of interconnecting networks ■ Classification based on the number of dimensions ■ One-dimensional ■ Two-dimensional, planar ■ Three-dimensional, cubic Hypercube, tesseract (higher dimensions) Luděk Matýska • Parallel computers • Spring 2024 36/71 FACULTY OF INFORMATICS Masaryk University One-dimensional interconnects ■ Linear array ■ Individual nodes serially connected ■ "(string of) beads" ■ Simplest ■ Worst properties (for N > 2) Luděk Matýska • Parallel computers • Spring 2024 37/71 FACULTY OF INFORMATICS Masaryk University Two-dimensional interconnects ■ Ring ■ Closed linear array ■ Star ■ Tree ■ Decreases network radius (2 log ^y^1) ■ Still bad redundancy and bisection (band)width ■ Fat tree ■ Adds redundant links at higher levels ■ Improves bisection bandwidth Luděk Matýska • Parallel computers • Spring 2024 38/71 FACULTY OF INFORMATICS Masaryk University Fat tree topology Luděk Matýska • Parallel computers • Spring 2024 39 / 71 FACULTY OF INFORMATICS I Masaryk University Two-dimensional mesh ■ Very popular ■ Good properties ■ Radius 2(N1/2 - 1) ■ Bisection N1/2 ■ Redundancy 2 ■ However a higher cost and variable node degree ■ Torus ■ closed two-dimensional mesh ■ Radius N1/2 ■ Bisection 2N1/2 ■ Redundancy 4 ■ Higher cost - adds 2N1/2 Links Luděk Matýska • Parallel computers • Spring 2024 40/71 FACULTY OF INFORMATICS Masaryk University Three-dimensional mesh ■ Properties ■ Radius 3(^/3 _ i) ■ Bisection W2/3 ■ Redundancy 3 ■ Acceptable cost 2(N - W2/3) ■ Difficult for construction Luděk Matýska • Parallel computers • Spring 2024 41/71 FACULTY OF INFORMATICS Masaryk University Torus topologies 1-D Torus (4-ary 1-cube) 2D Torus (4-ary 2-cube) 3D Torus (3-ary 3-cube) Luděk Matýska • Parallel computers • Spring 2024 42/71 FACULTY OF INFORMATICS Masaryk University Torus topologies Luděk Matýska • Parallel computers • Spring 2024 43/71 FACULTY OF INFORMATICS I Masaryk University Hypercube, tesseract ■ Very interesting topology ■ In general n-dimensionaL cube ■ Basic parameters ■ Radius log N ■ Bisection N/2 ■ Redundancy logN ■ Higher cost (N log N)/2 ■ Meshes are special cases of hypercube (Lower dimensionality) ■ Routing very simple ■ Based on binary numbering of nodes Ludek Matyska • Parallel computers • Spring 2024 44/71 FACULTY OF INFORMATICS Masaryk University Deadlock Free 6D torus (K Computer) Luděk Matýska • Parallel computers • Spring 2024 45/71 FACULTY OF INFORMATICS Masaryk University Fully connected network ■ More as a theoretical construct ■ Excellent radius (1) ■ Unacceptable cost (N * (N — l)/2) and node degree (N — 1) Luděk Matýska • Parallel computers • Spring 2024 46/71 FACULTY OF INFORMATICS Masaryk University Switching ■ Specific mechanism how the packet move from input to output (on a transit node) ■ Basic approaches ■ Packet switching, store-and-forward ■ Circuit switching ■ Virtual connection (cut-through) ■ Wormhole routing Luděk Matýska • Parallel computers • Spring 2024 47/71 FACULTY OF INFORMATICS Masaryk University Store-and-forward ■ The whole packet is stored on the transit node ■ And is send out only after the fuLL packet is received ■ Simple (first generation of parallel computers) ■ High Latency | * D ■ P is packet size, B is throughput and D is number of "hops" (distance) Luděk Matýska • Parallel computers • Spring 2024 48/71 FACULTY OF INFORMATICS I Masaryk University Circuit switching ■ Three stages ■ Connection setup - initiated by a probe ■ Data transmission ■ Connection tearing ■ Visibly Lower Latency | * D + ^ ■ P is the size of the probe and M is a size of the message (packets are not needed nor reLevant) ■ For P << M Latency is independent of the path size Luděk Matýska • Parallel computers • Spring 2024 49/71 FACULTY OF INFORMATICS Masaryk University Virtual connection ■ Message is split into smaller blocks - flow control digits/units (flits) ■ The first flit contains the path info (initially the target address) ■ Next flits contain the data (payload) ■ Last flit breaks the path ■ Individual flits are sent as a continuous stream With buffers of sufficient size, this responds to the circuit switching ■ Latency ^ * D + ^ m HF is flit length, usually HF « M Luděk Matýska • Parallel computers • Spring 2024 50/71 FACULTY OF INFORMATICS I Masaryk University Wormhole ■ Special case of virtual circuit ■ Buffer size exactly fits the flit size ■ Latency does not depend on the distance (the Length of the path) ■ Pipeline analogue ■ The whole packet resides in several buffers at different nodes on the path - thus the wormhole ■ Latency is considered at the level of the whole message transfer, not per flits; number of flits is much higher than the distance (the length of the path = number of hops) ■ transmission time is a sum of the length of the path and number of the transferred flits ■ while for store-and-forward it relates to the product of these two values ■ The protocol supports packet replication ■ convenient for multicast and broadcast Luděk Matýska • Parallel computers • Spring 2024 51/71 FACULTY OF INFORMATICS Masaryk University Wormhole Three Interfering Flows Luděk Matýska • Parallel computers • Spring 2024 52/71 FACULTY OF INFORMATICS Masaryk University Virtual channels ■ Physical channels sharing ■ Several buffers over the same channel ■ Flits stored in the appropriate buffer ■ Use ■ Overloaded links ■ Deadlock avoidance ■ Logical to physical topology mapping ■ Guarantees of sufficient transport capacity for system data Luděk Matýska • Parallel computers • Spring 2024 53/71 FACULTY OF INFORMATICS Masaryk University Routing in the interconnecting networks ■ Path discovery ■ Properties ■ Static routing ■ Source based ■ Distributed ■ Adaptive routing (always distributed) ■ Minimal but also non-minimal routes Luděk Matyska • Parallel computers • Spring 2024 54/71 FACULTY OF INFORMATICS Masaryk University Fault tolerance of interconnecting networks ■ Error check ■ Message acknowledgment ■ Message re-transmission Luděk Matýska • Parallel computers • Spring 2024 FACULTY OF INFORMATICS Masaryk University Memory Latency ■ Memory considerably slower than the processor ■ Waiting for memory substantially decreases efficiency of the whole system ■ Solutions: ■ Latency decrease - access speedup ■ Latency hiding - interleaving of computing and data transmission Luděk Matýska • Parallel computers • Spring 2024 56/71 FACULTY OF INFORMATICS Masaryk University Memory Latency Decrease ■ NUMA: Non-Uniform Memory Access ■ Each logical address is mapped to a concrete physical address ■ COMA: Cache-Only Memory Architecture Main memory considered as an attraction memory ■ Memory rows can be moved freely ■ A memory row can have several copies Luděk Matýska • Parallel computers • Spring 2024 57/71 FACULTY OF INFORMATICS Masaryk University Recapitulation Communication to computation ratio NUMA COMA Small working set Large working set Small working set Large working set Low Good Medium Good Good High Medium Poor Poor Poor Luděk Matýska • Parallel computers • Spring 2024 58/71 FACULTY OF INFORMATICS Masaryk University Memory Latency Hiding ■ Weak consistency models ■ Prefetch ■ MuLtipLe-context processors ■ Producer initiated communication Luděk Matýska • Parallel computers • Spring 2024 59/71 FACULTY OF INFORMATICS Masaryk University Weak consistency ■ Does not require a strict synchronization access to shared variables unless explicitly required (explicit synchronization) ■ Release consistency: ■ New instructions acquire and release ■ Fence instruction ■ Enforced finalization of unfinished instructions (waiting) Luděk Matyska • Parallel computers • Spring 2024 60/71 FACULTY OF INFORMATICS Masaryk University Prefetch ■ Data moved to the process in advance ■ Binding prefetch ■ Data moved to the processor (register) ■ Risk of consistency break ■ Non-binding prefetch ■ Data moved to the cache only ■ HW Prefetch ■ SW Prefetch ■ Special instruction prefetch-exclusive: read followed by a write ■ Remember ANDES? Luděk Matýska • Parallel computers • Spring 2024 61/71 FACULTY OF INFORMATICS Masaryk University Multiple-context processors ■ Multithreading support ■ Requires ■ Very fast context switch (1-2 ticks) ■ Very high number of registers (the full set per each thread) ■ Many experimental systems ■ HEP (seventies) ■ Tera ■ *T Luděk Matýska • Parallel computers • Spring 2024 62/71 FACULTY OF INFORMATICS Masaryk University Producer initiated communication ■ Analogy of invalidate and update in cache coherency ■ Specific use for message passing systems (e.g. Cray T3D) or block-copy (computers with shared memory) ■ Suitable for transfer of Large blocks or for Lock-based synchronization Luděk Matýska • Parallel computers • Spring 2024 63/71 FACULTY OF INFORMATICS Masaryk University Synchronization support ■ Synchronization Leads to "hot spots" ■ Basic synchronization primitives/protocoLs ■ MutuaL excLusion (mutex) ■ Dynamic Load distribution ■ Events'propagation ■ GLobaL serialization (barriers) Luděk Matýska • Parallel computers • Spring 2024 FACULTY OF INFORMATICS Masaryk University Mutual exclusion (mutex) ■ Access to a shared variable is granted to at most one process at any given time ■ Universal, but usually rather expensive ■ Synchronization constructs of higher programming Languages ■ Semaphores ■ Monitors ■ Critical sections ■ Foundation - hardware support ■ test&set instruction ■ test-and-test&set instruction Spin waiting protocol Luděk Matýska • Parallel computers • Spring 2024 65/71 FACULTY OF INFORMATICS I Masaryk University test & set ■ Properties ■ An atomic instruction that reads the actual content and sets the content to 1 (CLOSED) ■ Busy (active) waiting till the read value is 0 char *lock; while (exchange(lock, CLOSED) == CLOSED ); ■ Highly stressing cache coherent multiprocessor systems ■ Each "set" flushes all caches Luděk Matýska • Parallel computers • Spring 2024 66/71 FACULTY OF INFORMATICS Masaryk University test-and-test&set ■ Properties ■ An instruction that reads the actual value and run the atomic test&set only if the read value is 0 for (;;) while flock == CLOSED); if (exchange(lock, CLOSED) != CLOSED) break; ■ Protocol responsive to the cache subsystem properties ■ First test over the shared (in cache) copy Luděk Matýska • Parallel computers • Spring 2024 67/71 FACULTY OF INFORMATICS Masaryk University Use of queues ■ Improvement: Collision avoidance schemes ■ Queue on Lock bit (QOLB) protocol ■ The most effective implementation ■ Processes Lined up in a queue ■ After lock release the process at the queue head is activated ■ No data transfer among all waiting processes needed Luděk Matýska • Parallel computers • Spring 2024 68/71 FACULTY OF INFORMATICS I Masaryk University Locks in multiprocessors ■ Related to the dynamic Load distribution ■ Use of counter with atomic operation ■ Fetch&Op - counters, e.g., Op==Add fetch&add (x, a) int *x, a; { int temp; temp = *x; *x += a; return (temp); } ■ Compare&Swap - lists Luděk Matýska • Parallel computers • Spring 2024 69/71 FACULTY OF INFORMATICS Masaryk University Event's propagation GLobaL events/signals used as a tool ■ for producers to inform consumers that data are ready ■ to inform about a global change in a set of equivalent processes ■ Status change that must be announced to all processes Luděk Matýska • Parallel computers • Spring 2024 70/71 FACULTY OF INFORMATICS Masaryk University Barriers ■ A point of synchronization ■ To let all processes synchronize at the same point ■ No process can pass the barrier unless all processes reached it Luděk Matýska • Parallel computers • Spring 2024 71/71