Unit VI: Pipelining | CSE211: COMPUTER ORGANIZATION AND DESIGN | B.Tech CSE


Unit VI: Pipelining


⭐Numerical Formula:


  • Pipeline Speedup:
S=Execution time without pipelineExecution time with pipeline
  • Throughput:
T=Number of instructionsTotal time
  • Efficiency:
E=Pipeline speedupNumber of stages


⭐Advanced Caches-I


1. Cache Pipelining

  • Definition: A method of breaking down cache operations into smaller, sequential stages, allowing multiple operations to be processed simultaneously.
  • Stages:
    • Tag Check: Checks if the requested data is in the cache by comparing tags.
    • Data Access: Reads or writes data in cache memory if the tag check is successful.
    • Write-back (if applicable): Writes data back to lower memory levels if there’s a modified block in the cache.
  • Advantages:
    • Higher Throughput: Multiple memory operations can be processed at different stages, reducing delays.
    • Reduced Access Time: By overlapping stages, it decreases the effective latency of cache accesses.

2. Write Buffers

  • Definition: A small, fast memory area that holds data waiting to be written to main memory or a lower-level cache.
  • Purpose: Allows the CPU to continue processing without stalling while waiting for write operations to complete.
  • Operation:
    • When the CPU writes to memory, the data is temporarily stored in the write buffer.
    • The write buffer manages these pending writes and eventually writes the data to the main memory when it’s not busy.
  • Benefits:
    • Reduces Write Latency: CPU can perform other operations while write requests are handled in the background.
    • Prevents Stalls: Reduces stalls in pipelines by allowing reads and writes to proceed concurrently.

3. Multilevel Caches

  • Definition: A hierarchical structure of caches, typically including multiple levels (L1, L2, and sometimes L3) with varying speeds and sizes.
  • Levels:
    • L1 Cache: Fastest, smallest, and closest to the CPU; designed for the highest speed and lowest latency.
    • L2 Cache: Larger than L1, with slightly longer latency; it stores data not found in L1.
    • L3 Cache: Largest and slowest (if present), shared across cores in multi-core systems.
  • Benefits:
    • Reduced Access Time: Most data requests are fulfilled by the faster caches, reducing the need to access slower main memory.
    • Improved Hit Rate: By using multiple levels, the system can keep more frequently accessed data close to the CPU, improving performance.

4. Victim Caches

  • Definition: A small, fully associative cache that holds recently evicted cache lines from a higher-level cache (typically L1).
  • Purpose: Reduces conflict misses by storing data recently evicted from the main cache.
  • How It Works:
    • When data is evicted from L1, it is stored in the victim cache.
    • If a future request matches data in the victim cache, it’s retrieved from there instead of accessing the slower main memory or L2.
  • Benefits:
    • Reduces Miss Penalty: Helps recover data that may be accessed again soon after eviction, decreasing miss rates.
    • Efficient for Direct-Mapped Caches: Particularly beneficial in direct-mapped caches where conflict misses are common due to limited associativity.

5. Prefetching

  • Definition: A technique to fetch data into cache before it’s actually requested by the CPU, anticipating future accesses.
  • Types:
    • Hardware Prefetching: Managed by the CPU hardware, which predicts and fetches future data based on access patterns.
    • Software Prefetching: Compiler inserts prefetch instructions, typically based on predictable data access patterns in the code.
  • Methods:
    • Sequential Prefetching: Fetches the next block in sequence if it predicts the CPU will need it soon.
    • Stride Prefetching: Recognizes and prefetches data with regular patterns (e.g., array accesses with a fixed interval).
  • Advantages:
    • Reduces Cache Misses: By pre-loading data, the cache is more likely to contain required data when it’s accessed.
    • Improves Performance: Minimizes idle time by ensuring the CPU has necessary data available in advance.

⭐Advanced Caches-II


1. Software Memory Optimization

  • Definition: Techniques used to improve the efficiency of memory usage and data access patterns in programs, minimizing cache misses and enhancing overall performance.
  • Purpose: To maximize the use of cache by organizing code and data to better fit cache memory, reducing the need for slow main memory access.
  • Techniques:
    • Loop Tiling (Blocking):
      • Divides large data structures into smaller blocks to increase cache reuse within each block.
      • Commonly used for matrix operations to fit submatrices into the cache.
    • Data Layout Optimization:
      • Structuring Arrays: Arranges data in a way that matches how it will be accessed, ensuring cache lines are fully utilized.
      • Array of Structures (AoS) vs. Structure of Arrays (SoA): Choose data layouts that better match access patterns, especially in SIMD operations.
    • Loop Unrolling:
      • Expands the loop to reduce loop control overhead and enable better pipelining and cache usage.
      • Example: If accessing an array in a loop, unrolling allows multiple array elements to be accessed at once, enhancing data locality.
    • Memory Access Reordering:
      • Reorders computations to access data in a sequential pattern, reducing cache misses.
      • Example: Accessing arrays row-wise instead of column-wise to exploit spatial locality in row-major memory layouts.
    • Prefetching (Software-Controlled):
      • Inserting prefetch instructions manually to bring data into the cache before it is needed.
      • Helps avoid cache misses when data access patterns are predictable.
    • Minimizing Cache Interference:
      • Organizing data to avoid multiple pieces of frequently used data mapping to the same cache line, reducing conflict misses.
      • Example: By padding arrays or adjusting data placement, you can reduce conflicts in direct-mapped or set-associative caches.
  • Benefits:
    • Improved Cache Hit Rate: Maximizes data usage in the cache, minimizing costly memory accesses.
    • Faster Execution: Reduces the number of cache misses, allowing the CPU to access data faster.

2. Nonblocking Caches

  • Definition: A type of cache that allows multiple cache requests to be processed simultaneously without blocking or stalling other requests.
  • Purpose: To allow the CPU to continue executing instructions while cache misses are being handled, rather than waiting for each request to complete before moving on.
  • Operation:
    • When a cache miss occurs, the CPU can proceed with other instructions that do not depend on the missed data.
    • Multiple cache misses can be handled concurrently, reducing delays and improving throughput.
  • Key Features:
    • Miss Status Holding Registers (MSHRs):
      • Holds information about outstanding cache misses, allowing the cache to track and manage multiple misses at once.
    • Multiple Miss Handling:
      • Supports concurrent requests for data from memory, allowing non-blocking behavior during cache misses.
      • Helps in handling long-latency memory operations without stalling the pipeline.
    • Out-of-Order Execution Compatibility:
      • Complements out-of-order processors by allowing cache misses to be handled in parallel with other independent instructions.
  • Types:
    • Fully Nonblocking Cache: Allows any number of concurrent misses, though it’s complex and requires more resources.
    • Partially Nonblocking Cache: Limits the number of concurrent misses it can handle, balancing complexity and performance.
  • Benefits:
    • Reduces Cache Miss Penalty: Decreases waiting time for the CPU during cache misses, allowing it to perform other tasks.
    • Increases Throughput: By keeping the pipeline filled with useful instructions, overall execution speed is improved, especially in memory-intensive applications.

⭐Vector Processors and GPUs


1. Introduction

Vector Processors:

  • Definition: Processors that can operate on entire arrays (vectors) of data with a single instruction.
  • Purpose: Designed to handle data in parallel, making them ideal for tasks like scientific computations, graphics, and signal processing.
  • How They Work:
    • Instead of processing one data element at a time, they apply the same operation to multiple elements simultaneously.
    • Example: Adding two arrays element-by-element in a single operation.

GPUs (Graphics Processing Units):

  • Definition: Highly parallel processors originally designed to handle graphics and image processing tasks, now used in a wide range of applications.
  • Purpose: Optimized for data-parallel operations, GPUs can process thousands of threads at once, making them suitable for AI, gaming, scientific computing, and more.
  • Architecture:
    • Consists of many small, efficient cores that can perform computations in parallel.
    • Each core is simpler than a CPU core but optimized for parallel workloads, executing the same instruction across multiple data points.

2. Hardware Optimization

Vector Processor Hardware Optimization:

  • Multiple Functional Units:
    • Vector processors contain multiple ALUs (Arithmetic Logic Units) that allow simultaneous execution of operations on multiple data elements.
  • Vector Registers:
    • Large registers that can hold entire vectors, reducing the need for frequent memory access.
  • Memory Bandwidth Optimization:
    • High memory bandwidth is essential to keep the data flow to and from the processor fast enough to support vector operations.

GPU Hardware Optimization:

  • SIMD (Single Instruction, Multiple Data):
    • Executes the same instruction across multiple data points, making GPUs ideal for parallel tasks.
  • Stream Multiprocessors (SMs):
    • Each SM can execute multiple threads concurrently, organized into groups called warps.
  • High Memory Bandwidth:
    • GPUs have high memory bandwidth to handle the large data requirements of parallel processing.
  • Texture and Shared Memory:
    • Specialized memory types to optimize data access patterns, often used in image processing.

Benefits of Hardware Optimization:

  • Increased Parallelism: Executes more operations in parallel, improving speed for large-scale data tasks.
  • Reduced Latency: By reducing memory accesses and optimizing data flow, latency is minimized.
  • Energy Efficiency: Processing multiple data points simultaneously can reduce energy consumption per task.

3. Vector Software and Compiler Optimization

Vector Software Optimization:

  • Loop Vectorization:
    • Converts scalar loops (operating on one data element at a time) into vector operations to leverage hardware parallelism.
    • Example: A loop that adds two arrays can be rewritten to add entire segments at once.
  • Data Structure Optimization:
    • Arranges data to ensure efficient access by vector processors or GPUs.
    • Using contiguous memory layouts for arrays allows faster, sequential access.
  • Memory Alignment:
    • Ensures data is aligned in memory for optimal access by vector registers, minimizing alignment-related penalties.
  • Prefetching Data:
    • Preloading data into cache or registers before it’s needed helps maintain continuous data flow and prevents stalls.

Compiler Optimization for Vector Processors and GPUs:

  • Automatic Vectorization:
    • Some compilers can automatically transform code to use vector instructions where possible, identifying parallelizable parts of the code.
    • Example: Compilers like GCC and Intel compilers can detect and apply vectorization to compatible loops.
  • SIMD Instructions:
    • Compilers optimize by using SIMD-specific instructions (e.g., AVX, SSE) to handle multiple data elements with a single instruction.
  • Loop Unrolling and Loop Fusion:
    • Loop Unrolling: Expands the loop body to perform multiple iterations per cycle, reducing loop overhead.
    • Loop Fusion: Merges two adjacent loops that operate on the same data, enhancing data locality and cache efficiency.
  • Memory Coalescing (for GPUs):
    • Organizes memory accesses to ensure adjacent threads access contiguous memory locations, reducing the number of memory transactions.
  • Thread Scheduling (for GPUs):
    • Compilers optimize the order in which threads are scheduled to maximize parallel efficiency and minimize idle time.

Benefits of Software and Compiler Optimization:

  • Improved Performance: Maximizes the use of vector and GPU hardware capabilities, leading to faster execution.
  • Efficient Memory Usage: Reduces cache misses and improves data locality.
  • Automatic Optimizations: Reduces the need for manual tuning, making it easier for developers to take advantage of vector and GPU processing.

⭐Multithreading


1. SIMD (Single Instruction, Multiple Data)

  • Definition: A parallel processing technique where a single instruction is executed on multiple data points simultaneously.
  • Purpose: Used to perform the same operation across multiple data elements, which is ideal for tasks with high data parallelism, such as image processing, scientific calculations, and matrix operations.
  • How It Works:
    • Single Instruction: Only one instruction is issued by the processor.
    • Multiple Data Streams: The same instruction operates on multiple data elements (e.g., adding two arrays of numbers element-by-element).
  • Example:
    • If we have two arrays, SIMD can add each corresponding element across both arrays in a single operation rather than looping through each pair of elements.
  • Applications:
    • Common in multimedia processing, such as video playback, gaming, and image filtering.
    • Used in AI and machine learning for vectorized computations.
  • Benefits:
    • Increased Throughput: Processes large datasets faster by handling multiple data points per instruction.
    • Energy Efficiency: Reduces the number of required instructions, saving power.
  • Limitations:
    • Best suited for tasks with identical operations on each data point; less effective for tasks requiring different operations.

2. GPUs (Graphics Processing Units)

  • Definition: Highly parallel processors initially designed for graphics rendering but now widely used for general-purpose computing (GPGPU).
  • Purpose: Optimized for tasks that can be split into many smaller, parallel subtasks, such as rendering images, simulations, and deep learning computations.
  • Architecture:
    • Many Cores: Contains thousands of small, efficient cores for high data-parallel processing.
    • Streaming Multiprocessors (SMs): Groups of cores in a GPU that can execute multiple threads in parallel.
    • Memory Types:
      • Global Memory: Large but slower; accessible by all threads.
      • Shared Memory: Fast memory shared by threads in the same block, useful for reducing data access times.
      • Registers: Fast, local storage within each core.
  • Working Principle:
    • A GPU executes thousands of threads in parallel, with each core performing the same operation across different data points.
    • GPUs are particularly efficient at SIMD operations, executing a single instruction across large sets of data.
  • Applications:
    • Graphics rendering, image processing, scientific simulations, cryptocurrency mining, AI and machine learning.
  • Benefits:
    • Massive Parallelism: Executes a high number of threads simultaneously, providing much faster computation for parallel tasks.
    • Efficient for High Data Parallelism: Handles tasks where the same operation must be applied to large datasets.
  • Limitations:
    • Less suited for tasks with high sequential dependencies.
    • Not as flexible for complex branching logic, which CPUs handle more efficiently.

3. Coarse-Grained Multithreading

  • Definition: A multithreading technique where a processor switches between threads only on costly events, such as cache misses or long memory access delays.
  • Purpose: To reduce idle time by allowing the processor to work on a different thread when the current thread is stalled.
  • How It Works:
    • Unlike fine-grained multithreading (where switches occur every cycle), coarse-grained multithreading only switches when a thread encounters a long latency event.
    • The processor begins executing a new thread while the first thread is waiting for its stalled operation (e.g., memory fetch) to complete.
  • Benefits:
    • Reduced Latency: Minimizes idle time by performing useful work from another thread while waiting for a long-latency event.
    • Increased Throughput: Keeps the processor busy, potentially improving overall system performance.
  • Example:
    • In a web server, if one thread is waiting for data from the database, the processor can switch to another thread handling a different request, thus improving response times.
  • Limitations:
    • Context Switching Overhead: Switching between threads involves some overhead and can reduce efficiency if switching occurs too frequently.
    • Less Responsive Than Fine-Grained Multithreading: Because thread switching only occurs on long-latency events, coarse-grained multithreading is less responsive to quickly changing workloads.

⭐Parallel Programming-I


1. Introduction to Parallel Programming

  • Definition: Parallel programming is a programming technique where multiple processes or threads execute simultaneously, working together to solve a problem faster than sequential processing.
  • Purpose: Aims to improve performance by dividing tasks among multiple processors or cores, allowing computations to be done in parallel.
  • Benefits:
    • Increased Performance: Reduces the time required to complete a task by running multiple operations at once.
    • Efficient Resource Utilization: Takes full advantage of multi-core processors.
    • Scalability: Can handle larger problems as additional cores become available.
  • Challenges:
    • Data Synchronization: Ensuring that multiple threads have consistent access to shared data.
    • Concurrency Issues: Handling scenarios where multiple threads need to access the same resources without conflicts.
    • Programming Complexity: Writing and debugging parallel programs is generally more complex than writing sequential programs.

2. Sequential Consistency

  • Definition: A consistency model in parallel programming where the result of executing a series of operations is as if the operations were executed in some sequential order, even if they are actually executed in parallel.
  • Explanation:
    • Sequential consistency ensures that all threads in a program see memory operations in the same order.
    • Even if threads run in parallel, they appear to follow a single global order for shared memory operations.
  • Importance:
    • Predictability: Makes parallel programs easier to reason about, as it guarantees an order of operations visible to all threads.
    • Debugging: Simplifies debugging by ensuring a consistent view of operations.
  • Example:
    • Suppose Thread A writes a value to a variable, and Thread B reads from that variable. In a sequentially consistent system, Thread B will either see the old or new value of the variable, depending on the global order of operations, but all threads agree on this order.
  • Limitations:
    • Performance Impact: Enforcing sequential consistency can limit performance as it restricts some optimizations that would allow greater flexibility in reordering operations.
    • Less Common in Modern Systems: Many modern architectures use weaker consistency models for better performance, requiring programmers to enforce consistency where needed.

3. Locks

  • Definition: A synchronization mechanism used in parallel programming to ensure that only one thread or process can access a critical section (a portion of code that accesses shared resources) at a time.
  • Purpose: Prevents data races and ensures that shared data remains consistent by allowing exclusive access to critical sections.
  • Types of Locks:
    • Mutex (Mutual Exclusion):
      • The most common lock type, ensuring only one thread can access a critical section at a time.
      • When a thread locks a mutex, other threads trying to lock it are blocked until it’s unlocked.
    • Spinlock:
      • A lightweight lock where a thread continuously checks if the lock is available.
      • Used in situations where waiting times are expected to be very short, as it avoids the overhead of putting threads to sleep.
    • Read-Write Lock:
      • Allows multiple threads to read data simultaneously, but only one thread to write at a time.
      • Useful for scenarios where reads are more frequent than writes, optimizing access times.
  • How Locks Work:
    • A lock is acquired before entering a critical section and released afterward.
    • If another thread tries to acquire the lock while it’s held, it will either wait or, in the case of a spinlock, keep checking until the lock is available.
  • Applications:
    • Ensuring consistent data when multiple threads access shared resources (e.g., database, shared memory).
    • Preventing race conditions, where two or more threads attempt to modify data at the same time.
  • Challenges:
    • Deadlock: Occurs when two or more threads wait indefinitely for each other to release locks, causing the program to halt.
    • Priority Inversion: A situation where a lower-priority thread holds a lock, blocking a higher-priority thread from executing.
    • Performance Overhead: Locks can slow down program execution, as threads must wait to access shared resources.

⭐Parallel Programming-II


1. Atomic Operations

  • Definition: Operations that are completed in a single step without interruption, meaning they are indivisible and cannot be broken down.
  • Purpose: Ensure that a single operation on a shared resource is completed entirely without interference from other threads, preventing race conditions.
  • Examples of Atomic Operations:
    • Incrementing a Counter: An atomic increment operation ensures that only one thread can update the counter at a time.
    • Compare-and-Swap (CAS): Compares the current value of a variable with a specified value, and if they match, swaps it with a new value.
  • Importance:
    • Thread Safety: Prevents data inconsistencies by ensuring that no other thread can access the data mid-operation.
    • Efficiency: Atomic operations are faster than locks since they don’t require context switching or waiting, ideal for simple tasks.
  • Limitations:
    • Limited Scope: Atomic operations are only applicable for simple tasks (like incrementing a variable) and may not be sufficient for complex operations that require multiple steps.

2. Memory Fences (Barriers)

  • Definition: Instructions that enforce ordering constraints on memory operations, ensuring that certain operations are completed before others begin.
  • Purpose: Helps maintain memory consistency across threads by controlling the order in which operations appear to execute, especially on systems with relaxed memory models.
  • Types of Memory Fences:
    • Load Fence: Ensures all load (read) operations before the fence complete before any subsequent loads.
    • Store Fence: Ensures all store (write) operations before the fence complete before any subsequent stores.
    • Full Fence: Ensures that all loads and stores before the fence complete before any loads or stores after it.
  • Usage in Multithreading:
    • Memory fences are essential in systems with multiple processors or cores, where instructions may be reordered for optimization.
    • They prevent threads from seeing inconsistent states due to instruction reordering by enforcing a strict order on specific operations.
  • Example:
    • In a producer-consumer setup, a memory fence can ensure that data written by the producer is visible to the consumer before it accesses it.
  • Limitations:
    • Performance Cost: Memory fences can slow down program execution by enforcing stricter ordering, reducing optimization flexibility.
    • Complexity: Adding memory fences correctly requires a deep understanding of the system’s memory model, which can be complex for developers.

3. Locks

  • Definition: A synchronization mechanism that allows only one thread to access a resource or critical section at a time.
  • Purpose: Prevents multiple threads from accessing shared resources simultaneously, ensuring data consistency.
  • Common Types of Locks:
    • Mutex (Mutual Exclusion): A standard lock that only allows one thread to execute in a critical section at a time.
    • Spinlock: A lock where a thread continually checks if the lock is available, ideal for short wait times.
    • Read-Write Lock:
      • Allows multiple threads to read concurrently but restricts write access to only one thread at a time.
      • Useful in scenarios with many reads and few writes.
  • How Locks Work:
    • A thread must acquire a lock before entering a critical section and release it after exiting.
    • If another thread tries to acquire the lock while it’s held, it must wait until the lock is released.
  • Issues with Locks:
    • Deadlock: Occurs when two or more threads wait on each other indefinitely to release locks, causing a standstill.
    • Priority Inversion: When a low-priority thread holds a lock needed by a high-priority thread, leading to delays.
    • Performance Overhead: Frequent use of locks can reduce performance, as threads spend time waiting rather than executing.

4. Semaphores

  • Definition: A synchronization tool that uses a counter to control access to a resource, allowing a specified number of threads to access the resource simultaneously.
  • Types of Semaphores:
    • Binary Semaphore: A semaphore with a value of either 0 or 1, similar to a mutex, allowing only one thread access at a time.
    • Counting Semaphore: Allows multiple threads to access a resource up to a specified limit, as defined by the semaphore’s count.
  • How Semaphores Work:
    • Wait (P Operation): Decreases the semaphore count by 1. If the count is zero, the thread is blocked until another thread increments the count.
    • Signal (V Operation): Increases the semaphore count by 1, allowing a waiting thread to access the resource if it was previously blocked.
  • Applications:
    • Resource Management: Used to control access to a pool of resources, such as limiting the number of threads that can use a database connection.
    • Thread Synchronization: Coordinates actions between threads, ensuring certain tasks are completed before others begin.
  • Example:
    • In a system with a limited number of database connections, a counting semaphore can limit the number of threads that access the database simultaneously.
  • Advantages:
    • Flexibility: Can handle multiple threads, making it ideal for managing access to limited resources.
    • Efficiency: Allows more than one thread access when appropriate, improving resource utilization.
  • Challenges:
    • Risk of Misuse: Incorrectly using semaphores can lead to issues like deadlocks.
    • Complexity: Requires careful management to ensure proper access and avoid conflicts or resource exhaustion.

⭐Small Multiprocessors


1. Bus Implementation

  • Definition: A communication system that transfers data between components in a multiprocessor system, allowing processors to communicate with each other and with memory.
  • Purpose: Enables processors to share memory and other resources, facilitating communication and data transfer within the system.
  • Key Components:
    • Data Bus: Carries data between processors, memory, and other devices.
    • Address Bus: Transmits the addresses of data locations in memory.
    • Control Bus: Sends control signals to coordinate the timing and direction of data flow.
  • Types of Bus Systems:
    • Single Bus: All processors and memory modules share a single bus; simple but can become a bottleneck with many processors.
    • Multiple Bus: Uses separate buses to improve bandwidth, allowing simultaneous data transfers.
  • Challenges:
    • Scalability: As more processors are added, contention for the bus increases, leading to slower communication.
    • Performance Bottlenecks: If multiple processors try to access the bus simultaneously, it can create delays.
  • Solution - Bus Arbitration:
    • Purpose: Decides which processor or device can use the bus when multiple devices request it.
    • Arbitration Methods:
      • Centralized Arbitration: A single controller grants access to the bus based on priority.
      • Distributed Arbitration: All devices participate in the arbitration process, deciding among themselves.

2. Cache Coherence Protocols

  • Definition: Protocols that maintain consistency of data in caches of different processors in a multiprocessor system.
  • Purpose: Ensures that all processors have a consistent view of memory, even when multiple caches hold copies of the same data.
  • Why Cache Coherence is Needed:
    • In multiprocessor systems, each processor has its own cache. If one processor updates a data value in its cache, other caches may have an outdated copy, leading to inconsistencies.
  • Types of Cache Coherence Problems:
    • Write-Through Problem: Occurs when one cache updates a shared variable, but other caches do not see the update.
    • False Sharing: Happens when processors repeatedly invalidate each other's cache entries, even though they are accessing different parts of the same cache line.
  • Main Cache Coherence Protocols:
    • Snooping Protocols:
      • Definition: Each cache monitors (or "snoops on") the shared bus to detect if other caches modify data that it holds.
      • Common Snooping Protocols:
        • Write-Invalidate: When a processor writes to a cache line, it invalidates the line in all other caches, ensuring only one valid copy exists at a time.
        • Write-Update (Write-Broadcast): When a processor writes to a cache line, it broadcasts the update to other caches, so all copies are updated.
      • Advantages:
        • Effective for small multiprocessors.
        • Simple implementation on a shared bus.
      • Disadvantages:
        • Becomes inefficient as the number of processors increases due to high traffic on the shared bus.
    • Directory-Based Protocols:
      • Definition: Uses a centralized directory to track which caches hold copies of each memory block. The directory manages the coherence, eliminating the need for caches to constantly monitor the bus.
      • How It Works:
        • The directory keeps track of the state of each memory block and which processors have a copy.
        • When a processor wants to read or write, it contacts the directory, which then handles the coherence action.
      • Advantages:
        • Scales better with more processors than snooping protocols.
        • Reduces bus traffic, as only necessary updates are communicated.
      • Disadvantages:
        • More complex to implement due to the need for a centralized directory and additional memory overhead.
  • States in Cache Coherence Protocols:
    • MESI Protocol (common in snooping-based systems):
      • Modified (M): The cache line is modified (dirty) and only exists in this cache. It must be written back to main memory before others can read it.
      • Exclusive (E): The cache line is unmodified, and this is the only cache holding it. No need to write back if modified.
      • Shared (S): The cache line may be present in multiple caches but is not modified.
      • Invalid (I): The cache line is invalid or outdated.


🚨Thanks for visiting classpdfindia✨

Welcome to a hub for 😇Nerds and knowledge seekers! Here, you'll find everything you need to stay updated on education, notes, books, and daily trends.

💗 Bookmark our site to stay connected and never miss an update!

💌 Have suggestions or need more content? Drop a comment below, and let us know what topics you'd like to see next! Your support means the world to us. 😍

19 Comments

Previous Post Next Post