FACULTY
OF INFORMATICS
Masaryk University
PA039: Supercomputer Architecture and Intensive Computing
Parallel computers
Luděk Matýska
Spring 2024
Luděk Matýska • Parallel computers • Spring 2024
1/71
FACULTY
OF INFORMATICS
Masaryk University
Parallel computers
■ SmaLL-scaLe multiprocessing
■ 2-several hundreds of cores (ten of processors)
■ mostly SMP (shared memory systems)
■ Large-scale multiprocessing
■ from hundreds to millions of cores (processors)
■ Most often distributed memory
Luděk Matýska • Parallel computers • Spring 2024
2/71
FACULTY
OF INFORMATICS
Masaryk University
Parallel computers (II)
■ Architecture
■ Single Instruction Multiple Data, SIMD
■ Multiple Instruction Multiple Data, MIMD
■ Programming models
■ Single Program Multiple Data, SPMD
■ Multiple programs Multiple Data, MPMD
■ Concurrent, Parallel, Distributed
■ Concurrent: A single program with multiple tasks in progress
■ Parallel: A single program with multiple task closely cooperating
■ Distributed: Several programs (loosely) cooperating
Luděk Matýska • Parallel computers • Spring 2024
3/71
FACULTY
OF INFORMATICS
Masaryk University
Architecture - SIMD
■ ALL processors synchronized
■ ALL performing the same instruction in any given time
■ Analogy of vector processors
■ SimpLe processors
■ SimpLe programming modeL
■ but difficult programming (esp. for simpLe scaLar operations)
Luděk Matýska • Parallel computers • Spring 2024
4/71
FACULTY
OF INFORMATICS
Masaryk University
Vector processor
■ Processor able to work directly with vectors of data
■ vector is a data type of the underlying instruction set
■ Cray introduced even vector registers (otherwise direct work with the memory)
■ Vector Load/Store
■ "composing" vector from different memory words/areas
■ vector of registers that keep addresses of memory words with actual data
■ "to localize" data for further processing
■ in practice the scather/gather operations directly over the main memory
Luděk Matýska • Parallel computers • Spring 2024
5/71
FACULTY
OF INFORMATICS
Masaryk University
Vector processor
■ Memory subsystem
■ as a default does not work with caches
■ interleaved (banked) memory
■ several concurrent operations over the main memory
■ it has higher throughput than use of caches esp. when the data are "randomly" scattered over the main memory (random access to data)
Scalar Data Bus
256 x 32 bits Instruction cache
Scalar CPU
Vector Unit
FPU Add Units
VLR VMR
Vector Register File
FPU Multiply Units
Vector Mem
Interface (VMI)
Eight banks data memory
Another eighi banks data memory
Luděk Matýska • Parallel computers • Spring 2024
6/71
FACULTY
OF INFORMATICS
I   Masaryk University
Architecture - MIMD
■ FuLLy asynchronous system
■ Individual processors fully independent
■ No special production needed (off-the-shelf)
■ Advantages
■ Higher flexibility
■ At least in theory higher efficiency
■ Disadvantages
■ Explicit synchronization
■ Difficult programming (easy race condition)
Luděk Matýska • Parallel computers • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Communication models
■ Shared Memory Architecture
■ Message passing
Luděk Matýska • Parallel computers • Spring 2024
8/71
FACULTY
OF INFORMATICS
Masaryk University
Shared Memory Architecture
■ Memory separated from processors
■ Uniform access to the memory
■ Bus as the easiest interconnect
■ Cheap interprocess/thread communication
■ Complex overlap of processing and communication (active waiting)
Luděk Matýska • Parallel computers • Spring 2024
9/71
FACULTY
OF INFORMATICS
Masaryk University
Message Passing
■ Each processor "visible"
■ Each processor has its own memory
■ Explicit communication - message passing
■ High communication cost (exchange of data)
■ More easier overlap of processing and communication
Luděk Matýska • Parallel computers • Spring 2024
10/71
FACULTY
OF INFORMATICS
Masaryk University
Hybrid systems
■ Nonuniform memory access architecture (NUMA)
■ Cache-onLy memory access architecture (COMA)
■ Distributed shared-memory (DSM)
Luděk Matýska • Parallel computers • Spring 2024
11/71
FACULTY
OF INFORMATICS
Masaryk University
Non-uniform memory access
■ Access to different memory addresses takes different time
■ Provides for higher scalability
■ Potentially Lower throughput
■ Cache memory coherence problem
■ ccNUMA - cache coherent NUMA
Luděk Matýska • Parallel computers • Spring 2024
12/71
FACULTY
OF INFORMATICS
Masaryk University
Cache only memory access
■ NUMA but resembling cache memory behavior
■ Data moves close to the processors that uses them
■ No hierarchy Like with caches
■ System must check that the last copy remains (can't delete it)
■ Experimental
■ Software-based hybrid NUMA-COMA implementation provided by ScaleMP
■ A shared-memory multiprocessor system on top of a cluster of commodity servers
Luděk Matýska • Parallel computers • Spring 2024
13/71
FACULTY
OF INFORMATICS
Masaryk University
Distributed shared-memory
■ Distributed system - cluster
■ Local memory on each node
■ Remote memory on other nodes
"Fiction" of a single extended memory
■ Hardware solution
■ Usually based on virtual memory principles (move pages between nodes)
■ Transparent
■ Software solution
■ Library
■ Non-transparent, programmer must modify the program (call of proper APIs from the library)
■ ScaleMP hybrid
Luděk Matýska • Parallel computers • Spring 2024
14/71
FACULTY
OF INFORMATICS
Masaryk University
Cache Memory Coherence
■ Reasons for cache miss:
■ Compulsory miss: 1st access to data
■ Capacity miss: insufficient capacity
■ Conflict miss: different memory areas mapped to the same cache row
■ Coherence miss: different data in different caches
■ Multiprocessors exposed to the Last case
■ But it can happen in a single processor case, too (how?)
Luděk Matýska • Parallel computers • Spring 2024
15/71
FACULTY
OF INFORMATICS
Masaryk University
Invalidation
■ Reaction on content change in remote (cache) memory
■ Row in the actual ("snooping") cache memory invalidated
■ If the same row is needed Later, it is retrieved from the memory (again)
Luděk Matýska • Parallel computers • Spring 2024
16/71
FACULTY
OF INFORMATICS
Masaryk University
Update
■ The cache memory row is updated immediately
■ If data are needed (again), they are already in the cache
■ Drawbacks
■ False sharing
■ Row includes more words
■ High load on the bus (broadcasting interconnect)
■ Invalidation and Update are equivalent performance-wise
■ Unless a specific memory access pattern is used
Luděk Matýska • Parallel computers • Spring 2024
17/71
FACULTY
OF INFORMATICS
Masaryk University
Coherence Cache Miss solutions
■ Cache memory must be aware of a change elsewhere
■ Broadcast based protocols
■ Directory based protocols
Luděk Matýska • Parallel computers • Spring 2024
18/71
FACULTY
OF INFORMATICS
Masaryk University
Snoopy cache
■ Broadcast based protocol
■ Communication network with a "natural" broadcast
■ Each processor foLLows/watches all memory accesses
Luděk Matýska • Parallel computers • Spring 2024
19/71
FACULTY
OF INFORMATICS
Masaryk University
Directory based protocols
■ Snoopy protocol based on broadcast
■ Not usable for more complex interconnect networks
■ Not scalable
■ Solution: Reduction of actively "touched" caches - Directories
■ Tag at each memory block
■ Cache memory with a copy of such a block explicitly references the tag
■ Special exclusivity tag (writing)
Luděk Matýska • Parallel computers • Spring 2024
20/71
FACULTY
OF INFORMATICS
Masaryk University
Directory based protocols
■ Three based schemas
■ Fully mapped directories
■ Limited directories (partially mapped)
■ Chained directories
■ We will compare them based on the following features
■ Size of the additional memory needed
■ Number of necessary instructions/steps (latency introduced)
Luděk Matýska • Parallel computers • Spring 2024
21/71
FACULTY
OF INFORMATICS
Masaryk University
Fully mapped directories
■ Each memory block is able to directly reference all caches (processors) simultaneously
■ Bit vector of copies
■ If a bit is set, the corresponding cache keeps a copy of the data block
■ Exclusivity tag
■ One per a block
■ Writing can be performed on one processor (one cache) only
■ Additional tags for each block in each cache
■ Validity tag
■ Exclusivity tag
Luděk Matýska • Parallel computers • Spring 2024
22/71
FACULTY
OF INFORMATICS
I   Masaryk University
Limited directories
■ FuLL directories rather expensive (Long bit vectors)
■ Additional memory needed: PM/B
■ P number of cache memories (processors)
■ M size of the main memory
■ B block size
■ Cumulative capacity of all caches usually smaller that the size of the main memory
■ Data block are usually not extensively shared
■ Most directory entries contain zeros
■ Solution: Use of direct references
■ However, one bit per cache is not enough
Luděk Matýska • Parallel computers • Spring 2024
23/71
FACULTY
OF INFORMATICS
Masaryk University
Limited directories
■ Set of pointers to the caches
■ Dynamic allocation as needed
■ Features
■ Number of bits per pointer: log2 P
■ Number of pointers in the pointer pool: k
■ More memory efficient than directly mapped if k < j^-p
■ Invalidation information sent only to caches keeping the copy of changed data
Luděk Matýska • Parallel computers • Spring 2024
24/71
FACULTY
OF INFORMATICS
Masaryk University
Overflow
■ If the pointer's pool is exhausted
■ Too many shared blocks
■ Possible reactions
■ Invalidation of all shared blocks (broadcast, Dir/B)
■ One entry selection (even random) and invalidation (no broadcast, Dir/NB)
Luděk Matýska • Parallel computers • Spring 2024
25/71
FACULTY
OF INFORMATICS
Masaryk University
Other modifications
■ Coarse-vector (Dir/CVr)
■ r is the size of a region (more caches/processors)je velikost regionu (vice procesoru), which corresponds to one bit (i.e. several caches/processors represented by one entry)
■ see the interconnect architectures
■ Switch of interpretation (whether a single processor or a region) as a result of overflow
■ Limited broadcast to aLL processors in the region
■ LimitLESS: software interrupt in case of overflow ("program decides")
Luděk Matýska • Parallel computers • Spring 2024
26/71
FACULTY
OF INFORMATICS
Masaryk University
Chained protocols
■ Cache-Based Linked-List
■ OnLy one pointer per memory block
■ Other pointers part of the cache memories
■ The cache memories (their blocks) are "chained"
■ Advantages
■ Memory footprint minimization
■ Drawbacks
■ Comples protocol
■ More communication ("from cache to cache"; more than theoretically necessary)
■ Write has higher latency (must pass through the whole chain)
Luděk Matýska • Parallel computers • Spring 2024
27/71
FACULTY
OF INFORMATICS
Masaryk University
Hierarchical directories
■ Usable in systems with multiple buses (n general broadcast supporting interconnects)
■ Hierarchy of caches
■ Higher hierarchy at each bus interconnect
■ Higher memory requirements
■ Higher hierarchy Level must keep copies of shared blocks from lower levels
■ No need of fast "pointer" memory
■ In principle a hierarchy of snoopy protocols with special extensions
Luděk Matýska • Parallel computers • Spring 2024
28/71
FACULTY
OF INFORMATICS
Masaryk University
Scalability
■ No single definition
■ Most used definition: Scalable is such a system for which the following holds
■ Performance grows linearly with price
■ The ratio Price/Performance is fixed
Both are equivalent, but usually only one condition is explicitly used
■ Alternative parameter - Scalability Extent
■ Performance change as a result of a processor addition
■ Price change as a result of a processor addition
■ Rational extent of number of processors considered
■ e.g. the Price/Performance ratio is not constant but within a defined interval (of number processors) the changes are small (and much higher outside the interval)
Luděk Matýska • Parallel computers • Spring 2024
29/71
FACULTY
OF INFORMATICS
Masaryk University
Speedup
S(N) =
Ideal speedup means
W(l)
Texec(n)
Tcomp(X) "I" Tcomm
Tcomm
(W)
Tcomp (N)   = Tcomp(l)/N
Tcomm
(W)
— Tcomm
(1)/N
Luděk Matýska • Parallel computers • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Speedup - commentary
■ Theoretical feature, reality depends on the application
■ Different values for different applications (on the same system)
■ Amdahl Law - extent of possible parallelization
■ Parallelizable and serial part of the task
■ Parallelizability has its limits
Luděk Matýska • Parallel computers • Spring 2024
31/71
FACULTY
OF INFORMATICS
Masaryk University
Extensible interconnecting networks
■ ParaLLeL systems must include interconnecting network
■ It influences behavior of the parallel system
■ Ideal extensible network
■ Low cost growing linearly with the number of processors (N)
■ Minimal latency independent of N
■ Throughput grows linearly with N
Luděk Matýska • Parallel computers • Spring 2024
32/71
FACULTY
OF INFORMATICS
Masaryk University
Network properties
■ Three basic components
■ Topology
■ Switching (how data moves between nodes)
■ Routing (how a path is computed)
Luděk Matýska • Parallel computers • Spring 2024
33/71
FACULTY
OF INFORMATICS
Masaryk University
Interconnecting networks
■ We distinguish the following parameters
■ Network size - number of nodes N
■ Node degree d (number of edges from a node)
■ Network radius D
■ Longest shortest path
■ Bisection width B
■ Network redundancy A
■ Minimal number of links that must be removed for the network to become split into two disconnected parts
■ Cost C
■ Number of links in the network
Luděk Matýska • Parallel computers • Spring 2024
34/71
FACULTY
OF INFORMATICS
Masaryk University
Bisection width
■ Bisection width
■ Minimal number of links that must be removed for the network to split into two same size parts
■ Bisection bandwidth
■ Commutative throughput of all removed links
■ Ideal properties:
■ Bisection bandwidth per process is constant
Luděk Matýska • Parallel computers • Spring 2024
35/71
FACULTY
OF INFORMATICS
Masaryk University
Topology of interconnecting networks
■ Classification based on the number of dimensions
■ One-dimensional
■ Two-dimensional, planar
■ Three-dimensional, cubic Hypercube, tesseract (higher dimensions)
Luděk Matýska • Parallel computers • Spring 2024
36/71
FACULTY
OF INFORMATICS
Masaryk University
One-dimensional interconnects
■ Linear array
■ Individual nodes serially connected
■ "(string of) beads"
■ Simplest
■ Worst properties (for N > 2)
Luděk Matýska • Parallel computers • Spring 2024
37/71
FACULTY
OF INFORMATICS
Masaryk University
Two-dimensional interconnects
■ Ring
■ Closed linear array
■ Star
■ Tree
■ Decreases network radius (2 log ^y^1)
■ Still bad redundancy and bisection (band)width
■ Fat tree
■ Adds redundant links at higher levels
■ Improves bisection bandwidth
Luděk Matýska • Parallel computers • Spring 2024
38/71
FACULTY
OF INFORMATICS
Masaryk University
Fat tree topology
Luděk Matýska • Parallel computers • Spring 2024 39 / 71
FACULTY
OF INFORMATICS
I   Masaryk University
Two-dimensional mesh
■ Very popular
■ Good properties
■ Radius 2(N1/2 - 1)
■ Bisection N1/2
■ Redundancy 2
■ However a higher cost and variable node degree
■ Torus
■ closed two-dimensional mesh
■ Radius N1/2
■ Bisection 2N1/2
■ Redundancy 4
■ Higher cost - adds 2N1/2 Links
Luděk Matýska • Parallel computers • Spring 2024
40/71
FACULTY
OF INFORMATICS
Masaryk University
Three-dimensional mesh
■ Properties
■ Radius 3(^/3 _ i)
■ Bisection W2/3
■ Redundancy 3
■ Acceptable cost 2(N - W2/3)
■ Difficult for construction
Luděk Matýska • Parallel computers • Spring 2024
41/71
FACULTY
OF INFORMATICS
Masaryk University
Torus topologies
1-D Torus (4-ary 1-cube) 2D Torus (4-ary 2-cube) 3D Torus (3-ary 3-cube)
Luděk Matýska • Parallel computers • Spring 2024
42/71
FACULTY
OF INFORMATICS
Masaryk University
Torus topologies
Luděk Matýska • Parallel computers • Spring 2024
43/71
FACULTY
OF INFORMATICS
I   Masaryk University
Hypercube, tesseract
■ Very interesting topology
■ In general n-dimensionaL cube
■ Basic parameters
■ Radius log N
■ Bisection N/2
■ Redundancy logN
■ Higher cost (N log N)/2
■ Meshes are special cases of hypercube (Lower dimensionality)
■ Routing very simple
■ Based on binary numbering of nodes
Ludek Matyska • Parallel computers • Spring 2024 44/71
FACULTY
OF INFORMATICS
Masaryk University
Deadlock Free 6D torus (K Computer)
Luděk Matýska • Parallel computers • Spring 2024
45/71
FACULTY
OF INFORMATICS
Masaryk University
Fully connected network
■ More as a theoretical construct
■ Excellent radius (1)
■ Unacceptable cost (N * (N — l)/2) and node degree (N — 1)
Luděk Matýska • Parallel computers • Spring 2024
46/71
FACULTY
OF INFORMATICS
Masaryk University
Switching
■ Specific mechanism how the packet move from input to output (on a transit node)
■ Basic approaches
■ Packet switching, store-and-forward
■ Circuit switching
■ Virtual connection (cut-through)
■ Wormhole routing
Luděk Matýska • Parallel computers • Spring 2024
47/71
FACULTY
OF INFORMATICS
Masaryk University
Store-and-forward
■ The whole packet is stored on the transit node
■ And is send out only after the fuLL packet is received
■ Simple (first generation of parallel computers)
■ High Latency | * D
■ P is packet size, B is throughput and D is number of "hops" (distance)
Luděk Matýska • Parallel computers • Spring 2024
48/71
FACULTY
OF INFORMATICS
I   Masaryk University
Circuit switching
■ Three stages
■ Connection setup - initiated by a probe
■ Data transmission
■ Connection tearing
■ Visibly Lower Latency | * D + ^
■ P is the size of the probe and M is a size of the message (packets are not needed nor reLevant)
■ For P << M Latency is independent of the path size
Luděk Matýska • Parallel computers • Spring 2024
49/71
FACULTY
OF INFORMATICS
Masaryk University
Virtual connection
■ Message is split into smaller blocks - flow control digits/units (flits)
■ The first flit contains the path info (initially the target address)
■ Next flits contain the data (payload)
■ Last flit breaks the path
■ Individual flits are sent as a continuous stream
With buffers of sufficient size, this responds to the circuit switching
■ Latency ^ * D + ^
m HF is flit length, usually HF « M
Luděk Matýska • Parallel computers • Spring 2024
50/71
FACULTY
OF INFORMATICS
I   Masaryk University
Wormhole
■ Special case of virtual circuit
■ Buffer size exactly fits the flit size
■ Latency does not depend on the distance (the Length of the path)
■ Pipeline analogue
■ The whole packet resides in several buffers at different nodes on the path - thus the wormhole
■ Latency is considered at the level of the whole message transfer, not per flits; number of flits is much higher than the distance (the length of the path = number of hops)
■ transmission time is a sum of the length of the path and number of the transferred flits
■ while for store-and-forward it relates to the product of these two values
■ The protocol supports packet replication
■ convenient for multicast and broadcast
Luděk Matýska • Parallel computers • Spring 2024
51/71
FACULTY
OF INFORMATICS
Masaryk University
Wormhole Three Interfering Flows
Luděk Matýska • Parallel computers • Spring 2024
52/71
FACULTY
OF INFORMATICS
Masaryk University
Virtual channels
■ Physical channels sharing
■ Several buffers over the same channel
■ Flits stored in the appropriate buffer
■ Use
■ Overloaded links
■ Deadlock avoidance
■ Logical to physical topology mapping
■ Guarantees of sufficient transport capacity for system data
Luděk Matýska • Parallel computers • Spring 2024
53/71
FACULTY
OF INFORMATICS
Masaryk University
Routing in the interconnecting networks
■ Path discovery
■ Properties
■ Static routing
■ Source based
■ Distributed
■ Adaptive routing (always distributed)
■ Minimal but also non-minimal routes
Luděk Matyska • Parallel computers • Spring 2024
54/71
FACULTY
OF INFORMATICS
Masaryk University
Fault tolerance of interconnecting networks
■ Error check
■ Message acknowledgment
■ Message re-transmission
Luděk Matýska • Parallel computers • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Memory Latency
■ Memory considerably slower than the processor
■ Waiting for memory substantially decreases efficiency of the whole system
■ Solutions:
■ Latency decrease - access speedup
■ Latency hiding - interleaving of computing and data transmission
Luděk Matýska • Parallel computers • Spring 2024
56/71
FACULTY
OF INFORMATICS
Masaryk University
Memory Latency Decrease
■ NUMA: Non-Uniform Memory Access
■ Each logical address is mapped to a concrete physical address
■ COMA: Cache-Only Memory Architecture
Main memory considered as an attraction memory
■ Memory rows can be moved freely
■ A memory row can have several copies
Luděk Matýska • Parallel computers • Spring 2024
57/71
FACULTY
OF INFORMATICS
Masaryk University
Recapitulation
Communication to computation ratio	NUMA		COMA	
	Small working set	Large working set	Small working set	Large working set
Low	Good	Medium	Good	Good
High	Medium	Poor	Poor	Poor
Luděk Matýska • Parallel computers • Spring 2024
58/71
FACULTY
OF INFORMATICS
Masaryk University
Memory Latency Hiding
■ Weak consistency models
■ Prefetch
■ MuLtipLe-context processors
■ Producer initiated communication
Luděk Matýska • Parallel computers • Spring 2024
59/71
FACULTY
OF INFORMATICS
Masaryk University
Weak consistency
■ Does not require a strict synchronization access to shared variables unless explicitly required (explicit synchronization)
■ Release consistency:
■ New instructions acquire and release
■ Fence instruction
■ Enforced finalization of unfinished instructions (waiting)
Luděk Matyska • Parallel computers • Spring 2024
60/71
FACULTY
OF INFORMATICS
Masaryk University
Prefetch
■ Data moved to the process in advance
■ Binding prefetch
■ Data moved to the processor (register)
■ Risk of consistency break
■ Non-binding prefetch
■ Data moved to the cache only
■ HW Prefetch
■ SW Prefetch
■ Special instruction prefetch-exclusive: read followed by a write
■ Remember ANDES?
Luděk Matýska • Parallel computers • Spring 2024
61/71
FACULTY
OF INFORMATICS
Masaryk University
Multiple-context processors
■ Multithreading support
■ Requires
■ Very fast context switch (1-2 ticks)
■ Very high number of registers (the full set per each thread)
■ Many experimental systems
■ HEP (seventies)
■ Tera
■ *T
Luděk Matýska • Parallel computers • Spring 2024
62/71
FACULTY
OF INFORMATICS
Masaryk University
Producer initiated communication
■ Analogy of invalidate and update in cache coherency
■ Specific use for message passing systems (e.g. Cray T3D) or block-copy (computers with shared memory)
■ Suitable for transfer of Large blocks or for Lock-based synchronization
Luděk Matýska • Parallel computers • Spring 2024
63/71
FACULTY
OF INFORMATICS
Masaryk University
Synchronization support
■ Synchronization Leads to "hot spots"
■ Basic synchronization primitives/protocoLs
■ MutuaL excLusion (mutex)
■ Dynamic Load distribution
■ Events'propagation
■ GLobaL serialization (barriers)
Luděk Matýska • Parallel computers • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Mutual exclusion (mutex)
■ Access to a shared variable is granted to at most one process at any given time
■ Universal, but usually rather expensive
■ Synchronization constructs of higher programming Languages
■ Semaphores
■ Monitors
■ Critical sections
■ Foundation - hardware support
■ test&set instruction
■ test-and-test&set instruction
Spin waiting protocol
Luděk Matýska • Parallel computers • Spring 2024
65/71
FACULTY
OF INFORMATICS
I   Masaryk University
test & set
■ Properties
■ An atomic instruction that reads the actual content and sets the content to 1 (CLOSED)
■ Busy (active) waiting till the read value is 0
char *lock;
while (exchange(lock, CLOSED) == CLOSED );
■ Highly stressing cache coherent multiprocessor systems
■ Each "set" flushes all caches
Luděk Matýska • Parallel computers • Spring 2024
66/71
FACULTY
OF INFORMATICS
Masaryk University
test-and-test&set
■ Properties
■ An instruction that reads the actual value and run the atomic test&set only if the read value is 0
for (;;)
while flock == CLOSED);
if (exchange(lock, CLOSED) != CLOSED)
break;
■ Protocol responsive to the cache subsystem properties
■ First test over the shared (in cache) copy
Luděk Matýska • Parallel computers • Spring 2024
67/71
FACULTY
OF INFORMATICS
Masaryk University
Use of queues
■ Improvement: Collision avoidance schemes
■ Queue on Lock bit (QOLB) protocol
■ The most effective implementation
■ Processes Lined up in a queue
■ After lock release the process at the queue head is activated
■ No data transfer among all waiting processes needed
Luděk Matýska • Parallel computers • Spring 2024
68/71
FACULTY
OF INFORMATICS
I   Masaryk University
Locks in multiprocessors
■ Related to the dynamic Load distribution
■ Use of counter with atomic operation
■ Fetch&Op - counters, e.g., Op==Add
fetch&add (x, a) int *x, a; { int temp;
temp = *x;
*x += a;
return (temp);
}
■ Compare&Swap - lists
Luděk Matýska • Parallel computers • Spring 2024
69/71
FACULTY
OF INFORMATICS
Masaryk University
Event's propagation
GLobaL events/signals used as a tool
■ for producers to inform consumers that data are ready
■ to inform about a global change in a set of equivalent processes
■ Status change that must be announced to all processes
Luděk Matýska • Parallel computers • Spring 2024
70/71
FACULTY
OF INFORMATICS
Masaryk University
Barriers
■ A point of synchronization
■ To let all processes synchronize at the same point
■ No process can pass the barrier unless all processes reached it
Luděk Matýska • Parallel computers • Spring 2024
71/71