TheUnknownBlog

A Simple Hack to Use Touch ID for sudo on macOS

Sat, 06 Sep 2025 14:24:00 GMT

If you spend any amount of time in the macOS Terminal, you know the drill. You type a command with sudo, press Enter, you type your long, secure password for the tenth time today, and you think, "There has to be a better way."

There is. And it's right at your fingertips. It's a simple, reversible, and game-changing tweak that you'll appreciate every single day.

How It Works (The Quick Version)

macOS uses a flexible system called PAM (Pluggable Authentication Modules) to handle authentication. All we're going to do is edit the configuration file for sudo to tell it: "Hey, before you ask for a password, just check for a valid fingerprint from Touch ID first. If that works, we're good to go."

The 2-Minute Setup Guide

Step 1: Open the Terminal

You can find it in Applications/Utilities or just search for it with Spotlight (⌘ + Space).

Step 2: Open the PAM Configuration File

We need to edit a protected system file, so we'll use the simple command-line editor nano with sudo privileges. Copy and paste the following command and press Enter. It will ask for your password (likely for the last time!).

sudo nano /etc/pam.d/sudo

The nano editor will open inside your Terminal window. You'll see a few lines of configuration text.

The most important part is getting this next step right. On a new line right after the first commented line (the one starting with #), add the following:

auth       sufficient     pam_tid.so

Make sure it is the very first active rule. For my system, the file looks like this after the edit:

# sudo: auth account password session

auth       sufficient     pam_tid.so   # <-- This is the line we added
auth       include        sudo_local
auth       sufficient     pam_smartcard.so
auth       required       pam_opendirectory.so
account    required       pam_permit.so
password   required       pam_deny.so
session    required       pam_permit.so

The keyword sufficient is what makes this work. It tells the system that if Touch ID authentication succeeds, it's enough to grant permission, and no other authentication methods (like your password) are needed.

Step 3: Save and Exit

Press Control + O to Write Out (save) the file.
Press Enter to confirm the filename.
Press Control + X to exit nano and return to your prompt.

Time to Test It!

For the change to take effect, you must open a new Terminal window or tab.

In your new Terminal session, type a simple sudo command, like:

sudo ls

Instead of a password prompt, you should be greeted by a Touch ID verification pop-up. Place your finger on the sensor, and your command will run. Welcome to the good life.

Good to Know

How do I undo this? Simply edit the /etc/pam.d/sudo file again and delete the auth sufficient pam_tid.so line you added.
What if it still asks for my password? You likely put the new line in the wrong place. Go back to Step 3 and make absolutely sure it's the first non-commented line in the file.
What about macOS updates? Major system updates can sometimes overwrite this file, reverting it to the default. If Touch ID suddenly stops working for sudo after an update, just repeat these steps.

That’s it! Enjoy the precious seconds you’ve reclaimed. Happy coding! 👍

Overview of the RISC-V Design with Tomasulo's Algorithm

Sun, 24 Aug 2025 18:09:00 GMT

Disclaimer

The content of this blog post has a large portion of AI-generated text (Google Gemini 2.5 Pro Deep Research). Although I have reviewed, edited the text, and did fact check, I cannot guarantee that it is 100% accurate or free of errors. Please use this content as a starting point for your own research and understanding, and verify any critical information independently.

With that said, I believe this post is super well-written and informative, and what really fascinates me is the "problem-solving" learning curve, which highlights the flaws and problems in every design choice, and segues into the components that solves the problem.

Part I: The Language of Hardware - Verilog Fundamentals

The study of processor design requires a fundamental shift in perspective. The tools and languages used to design hardware, such as Verilog, represent a different paradigm of computation. Understanding this paradigm is the first and most crucial step toward grasping the intricate workings of a modern CPU.

Section 1.1: Thinking in Parallel

The most significant conceptual leap from software to hardware is the transition from sequential to concurrent execution. A typical software program, written in a language like C++, is a sequence of instructions executed one after another by a processor. The program's state changes in a predictable, linear progression. In contrast, a physical hardware circuit is a collection of components—gates, flip-flops, memory blocks—that, once powered on, operate continuously and in parallel. A thousand logic gates do not wait their turn; they all compute their output based on their current inputs simultaneously, every moment in time.

It is for this reason that Verilog is classified as a Hardware Description Language (HDL), not a programming language in the traditional sense. Its primary purpose is not to provide a list of commands for a processor to execute, but to describe the physical structure and behavior of a digital electronic circuit. This description serves two main purposes: it can be fed into a simulation tool to model how the described circuit will behave over time, or it can be used by a synthesis tool to generate a netlist, which is a detailed blueprint for manufacturing an Application-Specific Integrated Circuit (ASIC) or configuring a Field-Programmable Gate Array (FPGA).

The fundamental unit of design in Verilog is the module. A module is a self-contained block of hardware logic, analogous to a class in C++ or a physical integrated circuit (IC) chip. It encapsulates internal logic and defines a clear interface to the outside world through a set of ports, which are declared as input, output, or inout. This modularity is essential for hierarchical design, allowing complex systems like an entire CPU to be built by connecting smaller, well-defined modules such as an Arithmetic Logic Unit (ALU), a register file, and a control unit.

Section 1.2: Describing Behavior - `initial` and `always` Blocks

Within a Verilog module, the behavior of the circuit is described primarily within two types of procedural blocks: initial and always. These blocks contain statements that define how the outputs and internal state of the module should change in response to inputs and time.

The `initial` Block

The initial block is the simpler of the two. As its name suggests, it contains a block of code that begins execution only once, at the very start of a simulation, at time zero. If multiple initial blocks are defined within a module, they all start concurrently at time zero.

This "run-once" behavior has a critical implication: initial blocks are generally not synthesizable. Real hardware does not have a concept of a "beginning of time" in the same way a simulation does; once powered on, it operates continuously. Therefore, an initial block cannot be translated into a physical circuit that performs an action only at power-on. Its primary role is within the realm of simulation, specifically in the construction of a testbench. A testbench is a separate Verilog module written to test the design (often called the "Design Under Test" or DUT). Within a testbench, initial blocks are indispensable for generating clock signals, providing a sequence of input stimuli to the DUT, and setting up initial memory states to verify the design's correctness.

The `always` Block

The always block is the cornerstone of synthesizable Verilog code. It contains a block of statements that execute repeatedly throughout the simulation. The execution of an always block is triggered by events specified in its sensitivity list, denoted by @(...). This behavior directly models the nature of real hardware, which continuously reacts to changes in its input signals or to clock edges.

The sensitivity list dictates what kind of hardware the always block describes:

always @(posedge clk): This syntax specifies that the block should execute only on the positive (rising) edge of the signal named clk. This is the standard way to describe sequential logic, such as flip-flops and registers, which are memory elements that capture and store a value at a specific moment defined by a clock signal.
always @(*): The asterisk is a shorthand that tells the simulator to execute the block whenever any of the signals read on the right-hand side of assignments within the block changes its value. This describes combinational logic—circuits like adders, multiplexers, or decoders whose outputs depend solely on their current inputs, with no memory of past states.

Because these constructs map directly to physical hardware components (clocked registers and logic gates), always blocks are the primary tool for describing the synthesizable behavior of a digital design.

Section 1.3: The Heart of Synthesis - Blocking vs. Non-blocking Assignments

Perhaps the most frequent and critical point of confusion for those transitioning from software to Verilog is the distinction between the two types of assignment operators: blocking (=) and non-blocking (<=). This is not a matter of stylistic preference; the choice of operator is a direct instruction to the synthesis tool about the type of hardware circuit to create. Misunderstanding this distinction is the leading cause of simulation-synthesis mismatches, where a design works perfectly in simulation but fails when implemented in actual hardware.

Blocking Assignments (`=`)

A blocking assignment is executed in the order it appears within a procedural block, much like in a C program. The execution of the current statement "blocks" the execution of any subsequent statements in the same begin...end block until it is complete. The variable on the left-hand side is updated immediately, and this new value is used by all subsequent statements in the block.

This immediate-update behavior models a chain of combinational logic. Imagine a series of logic gates connected by wires. The output of the first gate is instantaneously available as the input to the second gate. Blocking assignments are therefore the correct choice for describing this type of logic, typically within an always @(*) block.

Non-blocking Assignments (`<=`)

A non-blocking assignment operates in a two-phase manner that is fundamentally different from any software assignment. Within a block triggered by an event (like a clock edge), all the right-hand side (RHS) expressions of the non-blocking assignments are evaluated and stored in temporary variables first. Only after all RHS expressions have been evaluated does the second phase begin, where the left-hand side (LHS) variables are all updated simultaneously with their corresponding temporary values. The execution of one non-blocking assignment does not block the evaluation of the next.

This two-phase mechanism perfectly models the behavior of a bank of sequential logic elements, such as D-type flip-flops, that share a common clock. On a clock edge, all the flip-flops simultaneously sample the data at their D inputs. A short time later (the clock-to-Q delay), all their Q outputs change to reflect the newly captured values. The value of one flip-flop's output at the beginning of the clock cycle determines the input of the next flip-flop, but the update doesn't happen until the end of the cycle. Non-blocking assignments are therefore the correct and safe way to model state changes in sequential logic, and they should be used exclusively for assignments within a clocked always @(posedge clk) block.

Example: The Shift Register

The difference becomes crystal clear with a simple 3-bit shift register example. The goal is to have a value at data_in shift one position to the right on each clock cycle: data_in -> q1 -> q2 -> q3.

Incorrect Version (using Blocking =):

always @(posedge clk) begin
  q1 = data_in;
  q2 = q1;
  q3 = q2;
end

In simulation, on a single rising clock edge, q1 is immediately updated with data_in. Because this is a blocking assignment, that new value of q1 is then immediately used to update q2. And that new value of q2 is immediately used to update q3. The result is that the value from data_in propagates all the way to q3 within a single clock cycle. The synthesis tool will interpret this as a direct wire from data_in to q3, not a series of registers. This is not a shift register.

Correct Version (using Non-blocking <=):

always @(posedge clk) begin
  q1 <= data_in;
  q2 <= q1;
  q3 <= q2;
end

On a rising clock edge, the simulator evaluates all RHS expressions first: data_in, the current value of q1, and the current value of q2. Then, at the end of the simulation time step, it updates the LHS variables simultaneously. q1 gets the value of data_in, q2 gets the old value of q1, and q3 gets the old value of q2. This correctly models three separate flip-flops, and it takes three clock cycles for a value to propagate from data_in to q3. This is a true shift register.

Pitfalls and Best Practices

Mixing blocking and non-blocking assignments in the same always block, or using the wrong type for the logic intended, can lead to indeterminate behavior known as a race condition. This occurs when the final state of a variable depends on the unpredictable order in which a simulator evaluates concurrent events. To avoid these issues and ensure a design that is both simulatable and synthesizable, designers adhere to strict rules of thumb:

When modeling sequential logic (clocked always blocks), use non-blocking assignments (<=).
When modeling combinational logic (always @(*) blocks), use blocking assignments (=).
Do not mix blocking and non-blocking assignments in the same always block.

The underlying reason for these rules is to bridge the gap between the discrete event-scheduling model of a simulator and the continuous, physical reality of hardware. A non-blocking assignment is a directive to the simulator to schedule an update for the end of the current time step, which is how a synthesis tool understands the need for a memory element (a flip-flop) that holds a value across clock cycles. A blocking assignment directs the simulator to update a value immediately, which is how a synthesis tool understands a direct connection of logic gates whose output changes as soon as the input changes. Using the wrong operator creates a mismatch between what is simulated and what is built, which is the root cause of many hardware design bugs.

Part II: The Blueprint of a CPU - The RISC-V ISA

Having established the language for describing hardware, the next step is to understand the vocabulary that a processor speaks. This vocabulary is its Instruction Set Architecture (ISA), the fundamental interface between software and hardware. For this exploration, the RISC-V ISA provides an ideal foundation due to its modern, clean, and extensible design.

Section 2.1: An Introduction to Instruction Set Architectures (ISA)

An ISA is the abstract model of a computer that is visible to a machine-language programmer or compiler. It is the definitive contract between the software that runs on a processor and the hardware that executes it. This contract specifies a set of critical elements, including:

The set of available instructions (the "opcodes").
The native data types.
The programmer-visible registers.
The memory addressing modes.
The handling of events like interrupts and exceptions.

Any processor that correctly implements a given ISA will execute the same machine code and produce the same results, regardless of its internal microarchitectural design. An Intel Core i9 and an AMD Ryzen processor, for example, have vastly different internal designs but can both run Windows because they both implement the x86-64 ISA.

Section 2.2: The RISC-V Revolution - Openness and Modularity

RISC-V (pronounced "risk-five") is not just another ISA; it represents a paradigm shift in how ISAs are developed and used. It was born at the University of California, Berkeley, in 2010 with the goal of creating a practical, high-quality ISA that was open, free, and suitable for a wide range of computing applications, from academic research to industrial deployment.

The RISC Philosophy

At its core, RISC-V is a pure embodiment of the Reduced Instruction Set Computer (RISC) philosophy. This design approach contrasts with the Complex Instruction Set Computer (CISC) paradigm of architectures like x86. The core tenets of RISC, and by extension RISC-V, are:

A small number of simple instructions: The instruction set is kept minimal, focusing on fundamental operations. More complex operations are built by combining these simple instructions.
Fixed-length instruction encoding: All base instructions are the same length (32 bits), which dramatically simplifies the hardware required for instruction fetching and decoding.
Load/Store architecture: The only instructions that access memory are explicit load and store operations. All arithmetic and logical operations are performed on operands held in processor registers. This simplifies the control logic and encourages efficient register usage by compilers.
One instruction per cycle: The simplicity of the instructions is designed to allow for execution in a single clock cycle in a basic pipeline, which is key to achieving high performance.

This adherence to simplicity results in a more streamlined processor design, leading to improved performance, lower power consumption, and reduced design complexity.

Open and Free

Unlike proprietary ISAs such as x86 and ARM, the RISC-V specification is developed and maintained by the non-profit RISC-V International and is available under open-source licenses. This means anyone can design, manufacture, and sell RISC-V chips and software without paying royalties. This openness has catalyzed a global wave of innovation, enabling startups, academic institutions, and even large corporations to develop custom processors tailored for specific applications without the barrier of licensing fees or vendor lock-in.

Modular Design

A defining feature of RISC-V is its inherent modularity. The ISA is not a monolithic entity but is structured as a small, mandatory base integer ISA with a rich set of optional standard extensions. A processor's full ISA is specified by its base and the extensions it implements. For instance, a common configuration for a 64-bit general-purpose processor is denoted RV64GC, which stands for RV64IMAFDC.

The base integer ISAs are:

RV32I: The base 32-bit integer instruction set with 32 integer registers (x0-x31).
RV64I: The base 64-bit integer instruction set, extending the registers and operations to 64 bits.
RV32E: An embedded variant of RV32I with only 16 integer registers, designed for the smallest microcontrollers.

The most common standard extensions, often grouped under the letter 'G' for "General-Purpose," are:

M: Standard Extension for Integer Multiplication and Division. Adds instructions like mul, div, and rem.
A: Standard Extension for Atomic Instructions. Provides instructions for atomic memory operations (e.g., amoswap), essential for synchronization in multi-core systems.
F: Standard Extension for Single-Precision Floating-Point. Adds a separate floating-point register file (f0-f31) and instructions for 32-bit floating-point arithmetic.
D: Standard Extension for Double-Precision Floating-Point. Extends the F extension with support for 64-bit floating-point operations.
C: Standard Extension for Compressed Instructions. Defines 16-bit versions of the most common 32-bit instructions. This can significantly reduce code size and improve instruction fetch bandwidth, which is critical in memory-constrained embedded systems and for performance in high-end cores.

This modularity allows designers to create highly optimized processors. A tiny microcontroller for an IoT sensor might only implement RV32EMC, while a high-performance application processor in a data center might implement RV64G plus extensions for vector processing (V) and bit manipulation (B).

Section 2.3: Anatomy of a RISC-V Instruction

All base RISC-V instructions are 32 bits long and fall into one of a few well-defined formats. The regularity of these formats is a key design feature that enables the simple, high-performance pipelines for which RISC architectures are known. The primary formats are:

R-type (Register): Used for register-to-register operations like add, sub, and, or.
```
| funct7 (7) | rs2 (5) | rs1 (5) | funct3 (3) | rd (5) | opcode (7) |
```
- opcode: Defines the instruction type (e.g., OP for register-register arithmetic).
- rd: The destination register.
- funct3: Further specifies the operation (e.g., ADD/SUB).
- rs1, rs2: The two source registers.
- funct7: An additional field to differentiate operations (e.g., ADD from SUB).
I-type (Immediate): Used for operations with an immediate value, including addi, and for load instructions like lw.
```
| imm[11:0] (12) | rs1 (5) | funct3 (3) | rd (5) | opcode (7) |
```
- imm[11:0]: A 12-bit signed immediate value.
- rs1: The source register.
- rd: The destination register.
S-type (Store): Used for store instructions like sw (store word).
```
| imm[11:5] (7) | rs2 (5) | rs1 (5) | funct3 (3) | imm[4:0] (5) | opcode (7) |
```
- The 12-bit immediate is split to accommodate the two source register fields.
- rs1: The base address register.
- rs2: The register containing the data to be stored.
B-type (Branch): Used for conditional branch instructions like beq (branch if equal). Similar to S-type, the immediate is split.
```
| imm[12|10:5] (7) | rs2 (5) | rs1 (5) | funct3 (3) | imm[4:1|11] (5) | opcode (7) |
```
- rs1, rs2: The registers to be compared.
- imm: The signed branch offset, which is multiplied by 2 and added to the PC.
U-type (Upper Immediate): Used for loading a 20-bit upper immediate value, as in lui (load upper immediate).
```
| imm[31:12] (20) | rd (5) | opcode (7) |
```
J-type (Jump): Used for unconditional jumps like jal (jump and link).
```
| imm[20|10:1|11|19:12] (20) | rd (5) | opcode (7) |
```

The deliberate and consistent placement of the opcode, rs1, rs2, and rd fields across these formats is not an accident. It is a cornerstone of efficient RISC design. In a pipelined processor, the Instruction Decode (ID) stage must identify the source registers and read their values from the register file. Because rs1 and rs2 are always in the same bit positions for all instruction formats that use them (R, I, S, B), the decoder hardware is greatly simplified. It can begin reading from the register file before it has even finished fully decoding the instruction to determine the exact operation. This parallelism within the ID stage is a crucial enabler of the classic 5-stage RISC pipeline, a concept that forms the foundation of modern processor execution.

Part III: The Assembly Line - Pipelined Execution and Its Perils

To achieve high performance, modern processors do not execute instructions one at a time, waiting for each to complete before starting the next. Instead, they use a technique called pipelining, which overlaps the execution of multiple instructions, much like an assembly line in a factory. This approach is fundamental to all high-performance CPUs, and the RISC-V ISA is explicitly designed to facilitate it.

Section 3.1: The Classic 5-Stage RISC Pipeline

Pipelining increases the instruction throughput—the number of instructions completed per unit of time—without necessarily decreasing the latency of any single instruction. The concept is best understood through the analogy of doing laundry. A sequential approach would be to wash, dry, fold, and put away one load of laundry completely before starting the next. A pipelined approach starts the washer on the second load as soon as the first load moves to the dryer. By keeping all stages (washer, dryer, folding table) busy, the total time to complete many loads is significantly reduced.

Similarly, the execution of a RISC instruction can be broken down into a series of uniform steps. The classic RISC pipeline consists of five stages:

IF (Instruction Fetch): The processor fetches the 32-bit instruction from the instruction memory (or cache) at the address currently held by the Program Counter (PC). Concurrently, the PC is updated to point to the next instruction, which is typically at address $PC+4$ since each instruction is 4 bytes long.
ID (Instruction Decode and Register Fetch): The fetched instruction is decoded by the control unit to determine what operation to perform. The format of the instruction is identified, and the required control signals for subsequent stages are generated. Simultaneously, the source register identifiers (rs1 and rs2) are used to read their corresponding values from the processor's register file. Any immediate value in the instruction is also sign-extended and prepared for use.
EX (Execute): This is where the actual computation occurs. The Arithmetic Logic Unit (ALU) performs the operation specified by the instruction. This could be an arithmetic operation (add, sub), a logical operation (and, or), a memory address calculation for a load or store (by adding the base register and the immediate offset), or a comparison for a branch instruction.
MEM (Memory Access): This stage is active only for load and store instructions. For a load instruction (lw), the address calculated in the EX stage is used to read data from the data memory (or cache). For a store instruction (sw), the address and data are used to write to the data memory. For all other instructions (e.g., arithmetic or branch), this stage performs no operation.
WB (Write-Back): The final stage writes the result of the operation back into the register file. For an arithmetic instruction, the result comes from the ALU. For a load instruction, the result is the data read from memory. The destination register identifier (rd) from the instruction determines which register is written.

In an ideal scenario, a new instruction enters the IF stage every clock cycle. After five cycles, the pipeline is full, and one instruction completes every cycle, achieving an ideal throughput of one instruction per cycle (IPC).

Section 3.2: When the Assembly Line Breaks - Pipeline Hazards

The simple, elegant model of the 5-stage pipeline breaks down when dependencies between instructions conflict with the overlapped execution model. These conflicts are known as pipeline hazards, and they are the primary challenge in processor design. Hazards force the pipeline to stall, inserting "bubbles" where no useful work is done, thereby degrading performance. There are three main types of hazards.

Structural Hazards

A structural hazard occurs when two or more instructions in the pipeline require the same hardware resource at the same time. A classic example is a processor with a single, unified memory for both instructions and data. In such a design, a load instruction in its MEM stage would need to access memory simultaneously with a later instruction in its IF stage, which also needs to access memory to be fetched. This resource conflict would force one of the instructions to wait. The standard solution in RISC processors is to use a Harvard architecture, which employs separate, independent memories or caches for instructions and data, thus eliminating this specific hazard. Another potential structural hazard is in the register file, which is accessed for reads in the ID stage and for writes in the WB stage. This is typically resolved by designing the register file with separate read and write ports, or by performing writes in the first half of the clock cycle and reads in the second half.

Data Hazards

Data hazards arise from data dependencies between instructions. They occur when an instruction's execution depends on the result of a preceding instruction that is still in the pipeline.

Read-After-Write (RAW): This is the most common and intuitive data hazard. An instruction attempts to read a source register before a previous instruction has written its result back to that register. Consider the sequence:
```
add x5, x1, x2  // Instruction 1
sub x6, x5, x3  // Instruction 2
```
The sub instruction needs the value of x5, but the add instruction only calculates it in its EX stage and writes it back in its WB stage. By the time the sub instruction is in its ID stage ready to read x5, the add instruction has not yet completed its WB stage, so the register file contains an old, stale value for x5.
Write-After-Read (WAR): An instruction tries to write to a destination register before a preceding instruction has finished reading that register's original value. This is not a problem in the simple 5-stage pipeline because reads always happen in an earlier stage (ID) than writes (WB). However, it becomes a major issue in processors with out-of-order execution.
Write-After-Write (WAW): Two instructions in the pipeline are scheduled to write to the same destination register. Similar to WAR, this is not an issue in a simple in-order pipeline where writes happen in program order, but it is a critical hazard that must be managed in more complex designs.

Control Hazards

Control hazards, also known as branch hazards, are caused by branch and jump instructions that change the normal flow of program execution. The processor does not know the outcome of a conditional branch (whether it is taken or not taken) until the comparison is performed in the EX stage. By that time, the processor has already fetched and started decoding the instructions that sequentially follow the branch (at $PC+4$). If the branch is taken, these fetched instructions are incorrect and must be flushed from the pipeline, and the fetch must restart from the branch target address. This flushing process introduces stalls, or bubbles, into the pipeline, reducing performance.

Section 3.3: Basic Hazard Resolution - Stalling and Forwarding

To ensure correct program execution, hazards must be detected and resolved by the processor's control logic.

Stalling (Pipeline Bubbles)

The most straightforward solution to a hazard is to stall the pipeline. When the hazard detection logic in the ID stage identifies a dependency (e.g., a RAW hazard), it can freeze the early stages of the pipeline and insert no-operation instructions, or "bubbles," into the later stages. For the add/sub example above, the sub instruction would be held in the ID stage for several cycles until the add instruction completes its WB stage and the new value of x5 is available in the register file. While simple and effective, stalling is inefficient as it directly reduces the pipeline's throughput.

Forwarding (Bypassing)

A much more efficient solution for most data hazards is forwarding, also known as bypassing. The key observation is that the result of an operation is often available within the pipeline long before it is written back to the register file. For example, the result of the add instruction is available at the output of the ALU at the end of the EX stage. Forwarding logic adds extra data paths to send this result directly from the output of a later stage (like EX or MEM) back to the input of an earlier stage (like EX) for a subsequent, dependent instruction. This bypasses the need to wait for the result to be written to and then read from the register file. In the add/sub example, the result from the add instruction's EX stage can be forwarded directly to the input of the sub instruction's EX stage, completely eliminating the stall.

However, forwarding cannot solve all data hazards. A classic case is the load-use hazard. Consider this sequence:

lw  x5, 0(x1)   // Instruction 1
add x6, x5, x2  // Instruction 2

The lw instruction only has the data from memory available at the end of its MEM stage. The add instruction needs this data at the beginning of its EX stage. Even with a forwarding path from the MEM stage back to the EX stage, the data arrives one cycle too late. The add instruction must be stalled for one cycle. This limitation, along with the performance penalty from control hazards and the inefficiency of handling long-latency operations like floating-point division, reveals the inherent performance ceiling of a rigid, in-order pipeline. It is this ceiling that motivates the development of more sophisticated, dynamic execution techniques that can look further ahead in the instruction stream to find independent work to do.

| Hazard Type | Description | Example RISC-V Sequence | Simple Pipeline Effect | Solution(s) | | :---------- | :--------------------------------------------------------------------- | :------------------------------------------------------------------- | :--------------------------------------------------------------------- | :-------------------------------------------------------------------- | | Structural | Two instructions need the same resource in the same cycle. | lw in MEM stage, add in IF stage, both needing a unified memory. | One instruction must stall. | Separate Instruction/Data Memories (Harvard Architecture). | | Data (RAW) | An instruction needs the result of a previous, unfinished instruction. | add x5, x1, x2 followed by sub x6, x5, x3 | sub reads a stale value of x5 from the register file. | Stalling, Forwarding (Bypassing). | | Control | The address of the next instruction is unknown due to a branch. | beq x1, x2, L1 followed by add x3, x4, x5 | Processor fetches add before knowing if the branch to L1 is taken. | Stall until branch resolves, Branch Prediction, Flush incorrect path. |

Part IV: The Brains of the Operation - Dynamic Scheduling with Tomasulo's Algorithm

The limitations of in-order pipelining become severe in the presence of long-latency operations (like floating-point arithmetic or cache misses) and frequent data dependencies. Stalls can quickly dominate the execution time, leaving valuable functional units idle. To overcome this, high-performance processors employ dynamic scheduling, a technique that allows instructions to execute out of their original program order. The seminal hardware algorithm for this is Tomasulo's algorithm, first implemented in the IBM System/360 Model 91.

Section 4.1: Beyond In-Order Execution

The core idea behind dynamic scheduling is to shift from a control-flow-driven execution model to a dataflow-driven one. In a simple pipeline, an instruction executes when it reaches the front of the line. In a dynamically scheduled machine, an instruction is allowed to execute as soon as all of its required operands are available, regardless of its position in the original program sequence. This decoupling of instruction issue (fetching and decoding) from execution allows the processor to look ahead in the instruction stream, find independent instructions, and execute them while a prior, dependent instruction is stalled waiting for its data. This significantly increases the utilization of the processor's multiple execution units and improves overall performance.

Section 4.2: Core Components of the Tomasulo Machine

Tomasulo's algorithm achieves this dataflow execution through three key hardware components that work in concert.

Reservation Stations (RS)

Instead of a single pipeline, a Tomasulo-based processor has a set of functional units (e.g., one or more adders, multipliers, load/store units), each equipped with its own set of buffers called Reservation Stations (RS). When an instruction is decoded, it is issued to a free reservation station associated with the required functional unit. The RS acts as a waiting area, holding the instruction until it is ready to execute.

Each entry in a reservation station contains the following fields:

Busy: A bit indicating whether the station is in use.
Op: The operation to be performed (e.g., ADD, MUL).
Vj, Vk: The actual values of the two source operands. These fields are filled if the operand values are already available in the register file when the instruction is issued.
Qj, Qk: The source operand tags. If an operand is not yet available because it is being produced by another instruction currently in-flight, these fields will hold a tag that identifies which reservation station will produce the required result. A value of zero or null in these fields indicates that the corresponding V field holds a valid operand.
Dest: A tag identifying the destination of the result (in modern implementations, this is a pointer to a Reorder Buffer entry).

The RS continuously monitors for its required operands. Once both Qj and Qk are zero (meaning Vj and Vk are both valid), the instruction is ready to be dispatched to its functional unit for execution.

The Common Data Bus (CDB)

The Common Data Bus (CDB) is a broadcast bus that connects the outputs of all functional units to the inputs of all reservation stations and the register file. When a functional unit finishes its computation, it does not just write the result to a register. Instead, it places both the computed value and its unique tag (the name of the reservation station that produced it) onto the CDB.

All reservation stations are "snooping" (monitoring) the CDB in every cycle. If an RS sees a tag on the CDB that matches a tag in its Qj or Qk field, it knows its long-awaited operand is now available. It grabs the value from the CDB, places it into the corresponding Vj or Vk field, and clears the Qj or Qk field to zero. This mechanism allows results to be forwarded directly from producer to consumer without ever needing to pass through the register file, dramatically reducing stalls from RAW dependencies.

Hardware Register Renaming

Out-of-order execution introduces the possibility of WAR and WAW hazards, which were not a problem in the simple in-order pipeline. Tomasulo's algorithm elegantly eliminates these hazards through a mechanism called hardware register renaming.

Why WAR and WAW hazards are a problem in out-of-order execution? You can think about it yourself, or read Appendix A.

The key is to decouple the architectural registers (the names visible to the programmer, e.g., F0, F2, F4) from the physical storage locations (the reservation stations). A mapping table, often called the Register Alias Table (RAT) or Register Result Status, maintains the current mapping. For each architectural register, this table stores the tag of the reservation station that will produce the next value for that register.

The process works as follows:

Issue: When an instruction like ADD.D F6, F8, F2 is issued, the control logic looks up F8 and F2 in the RAT.
- If the RAT entry for a source register is empty, the value is ready in the main register file. This value is copied to the V field of the reservation station.
- If the RAT entry contains a tag (e.g., Add1), it means another instruction is currently computing the value. This tag is copied into the Q field of the new reservation station.
Rename: After reading the source tags, the logic updates the RAT entry for the destination register, F6, with the tag of the newly allocated reservation station (e.g., Add2). Now, any subsequent instruction that needs F6 will be directed to get its value from Add2.

This renaming process breaks false dependencies. If a later instruction also writes to F6 (a WAW hazard), it will simply be allocated a new reservation station (Add3), and the RAT will be updated to point to Add3. The original ADD.D instruction is unaffected because it is already linked to Add2. Similarly, WAR hazards are eliminated because source operands either get their value immediately or are linked to a specific producer via a tag; a subsequent write to that source register will be renamed to a new physical location and will not affect the original value needed by the earlier instruction.

Section 4.3: A Cycle-by-Cycle Walkthrough of Tomasulo's Algorithm

To solidify these concepts, a detailed, cycle-by-cycle trace of a sequence of dependent instructions is invaluable. This walkthrough will demonstrate the dynamic interplay between the reservation stations, the RAT, and the CDB.

Simulation Setup:

Functional Units: 1 Integer Unit (for effective address calculation): 1 cycle latency. 2 FP Adders (for ADD.D, SUB.D): 2 cycles latency. 2 FP Multipliers (for MUL.D): 10 cycles latency. 1 FP Divider (for DIV.D): 40 cycles latency.
Instruction Issue: 1 instruction per cycle.
CDB: 1 result can be broadcast per cycle.
Reservation Stations: 3 for Add/Sub, 2 for Mult/Div.

Example Instruction Sequence:

L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2

Initial State:

Register File: R2=100, R3=200, F4=2.0. All other FP registers have some initial value.
Memory: Mem[134]=10.0, Mem[245]=5.0.
All Reservation Stations are empty.
Register Result Status is empty (all values are in the Register File).

Cycle 1:

Events: L.D F6, 34(R2) is issued.
Actions: A Load buffer (Load1) is allocated. The value of R2 (100) is read from the integer register file. The effective address is calculated immediately: $100 + 34 = 134$. The Register Result Status for F6 is updated to point to Load1.

| Instruction | Issue | Execute | Write Result | | :--------------- | :---- | :------ | :----------- | | L.D F6, 34(R2) | 1 | | |

| Reservation Stations | Busy | Op | Vj | Vk | Qj | Qk | Address | | :------------------- | :--- | :--- | :-- | :-- | :-- | :-- | :------ | | Load1 | Yes | Load | 100 | | | | 34 | | Load2 | No | | | | | | |

| Register Result Status | F0 | F2 | F4 | F6 | F8 | F10 | | :--------------------- | :-- | :-- | :-- | :---- | :-- | :-- | | Qi | | | | Load1 | | |

CDB Activity: None.

Cycle 2:

Events: L.D F2, 45(R3) is issued. Load1 begins memory access.
Actions: Load2 buffer is allocated. Value of R3 (200) is read. Effective address $200 + 45 = 245$ is calculated. Register Result Status for F2 is updated to Load2.

| Instruction | Issue | Execute | Write Result | | :--------------- | :---- | :------ | :----------- | | L.D F6, 34(R2) | 1 | 2 | | | L.D F2, 45(R3) | 2 | | |

| Reservation Stations | Busy | Op | Vj | Vk | Qj | Qk | Address | | :------------------- | :--- | :--- | :-- | :-- | :-- | :-- | :------ | | Load1 | Yes | Load | 100 | | | | 34 | | Load2 | Yes | Load | 200 | | | | 45 |

| Register Result Status | F0 | F2 | F4 | F6 | F8 | F10 | | :--------------------- | :-- | :---- | :-- | :---- | :-- | :-- | | Qi | | Load2 | | Load1 | | |

CDB Activity: None.

Cycle 3:

Events: MUL.D F0, F2, F4 is issued. Load1 completes memory access. Load2 begins memory access.
Actions: A multiplier RS (Mult1) is allocated. RAT is checked for sources F2 and F4. F2 is being produced by Load2, so Qj of Mult1 gets tag Load2. F4 is ready in the register file, so its value (2.0) is copied to Vk. RAT for destination F0 is updated to Mult1.

| Instruction | Issue | Execute | Write Result | | :----------------- | :---- | :------ | :----------- | | L.D F6, 34(R2) | 1 | 2 | 3 | | L.D F2, 45(R3) | 2 | 3 | | | MUL.D F0, F2, F4 | 3 | | |

| Reservation Stations | Busy | Op | Vj | Vk | Qj | Qk | Address | | :------------------- | :--- | :---- | :-- | :-- | :---- | :-- | :------ | | Load1 | Yes | Load | ... | | | | | | Load2 | Yes | Load | ... | | | | | | Mult1 | Yes | MUL.D | | 2.0 | Load2 | | | | Add1 | No | | | | | | |

| Register Result Status | F0 | F2 | F4 | F6 | F8 | F10 | | :--------------------- | :---- | :---- | :-- | :---- | :-- | :-- | | Qi | Mult1 | Load2 | | Load1 | | |

CDB Activity: Load1 broadcasts result Mem[134] (value 10.0) with tag Load1.
- Snooping: No waiting RS needs Load1 yet. The RAT entry for F6 is updated with the value and the tag is cleared.

Cycle 4:

Events: SUB.D F8, F6, F2 is issued. Load2 completes memory access.
Actions: An adder RS (Add1) is allocated. RAT is checked for F6 and F2. F6 is now ready (value 10.0 from Load1's broadcast), so Vj gets 10.0. F2 is still being produced by Load2, so Qk gets tag Load2. RAT for F8 is updated to Add1.

| Instruction | Issue | Execute | Write Result | | :----------------- | :---- | :------ | :----------- | | L.D F6, 34(R2) | 1 | 2 | 3 | | L.D F2, 45(R3) | 2 | 3 | 4 | | MUL.D F0, F2, F4 | 3 | | | | SUB.D F8, F6, F2 | 4 | | |

| Reservation Stations | Busy | Op | Vj | Vk | Qj | Qk | | :------------------- | :--- | :---- | :--- | :-- | :---- | :---- | | Mult1 | Yes | MUL.D | | 2.0 | Load2 | | | Add1 | Yes | SUB.D | 10.0 | | | Load2 |

| Register Result Status | F0 | F2 | F4 | F6 | F8 | F10 | | :--------------------- | :---- | :---- | :-- | :-- | :--- | :-- | | Qi | Mult1 | Load2 | | | Add1 | |

CDB Activity: Load2 broadcasts result Mem[245] (value 5.0) with tag Load2.
- Snooping: Both Mult1 and Add1 are waiting for Load2. They both snoop the CDB, capture the value 5.0, and clear their Q fields. Mult1's Vj becomes 5.0. Add1's Vk becomes 5.0.

Cycle 5:

Events: DIV.D F10, F0, F6 is issued. Both Mult1 and Add1 are now ready to execute.
Actions: A divider RS (Div1) is allocated. RAT is checked for F0 and F6. F0 is being produced by Mult1. F6 is ready. Div1 gets tag Mult1 in Qj and value 10.0 in Vk. RAT for F10 is updated to Div1. Mult1 begins its 10-cycle execution (5.0 * 2.0). Add1 begins its 2-cycle execution (10.0 - 5.0). Note the out-of-order execution: SUB.D starts before MUL.D.

| Instruction | Issue | Execute | Write Result | | :------------------ | :---- | :------ | :----------- | | ... | ... | ... | ... | | MUL.D F0, F2, F4 | 3 | 5 | | | SUB.D F8, F6, F2 | 4 | 5 | | | DIV.D F10, F0, F6 | 5 | | |

| Reservation Stations | Busy | Op | Vj | Vk | Qj | Qk | | :------------------- | :--- | :---- | :--- | :--- | :---- | :-- | | Mult1 | Yes | MUL.D | 5.0 | 2.0 | | | | Add1 | Yes | SUB.D | 10.0 | 5.0 | | | | Div1 | Yes | DIV.D | | 10.0 | Mult1 | |

| Register Result Status | F0 | F2 | F4 | F6 | F8 | F10 | | :--------------------- | :---- | :-- | :-- | :-- | :--- | :--- | | Qi | Mult1 | | | | Add1 | Div1 |

CDB Activity: None.

...This process continues. The SUB.D will finish in cycle 6 and broadcast its result. The ADD.D (instruction 6) will issue and wait for results from Add1 and Load2. The MUL.D will finish in cycle 14 and broadcast, allowing the DIV.D to start its long 40-cycle execution. This detailed trace reveals how the hardware dynamically resolves dependencies and executes instructions as soon as their data is ready, maximizing parallelism.

Section 4.4: Taming the Chaos with the Reorder Buffer (ROB)

While Tomasulo's algorithm is brilliant at extracting instruction-level parallelism, its out-of-order completion creates a significant problem: it makes handling exceptions and branch mispredictions incredibly difficult. If MUL.D completes after SUB.D, but SUB.D causes an arithmetic exception, the machine state is inconsistent. The processor has modified state (F8) from an instruction that is logically after the faulting instruction. This is called an imprecise exception, and it makes operating systems and recovery mechanisms nearly impossible to implement correctly.

The solution is to add a new hardware structure, the Reorder Buffer (ROB), which extends the original algorithm to ensure that while instructions may execute out of order, they commit their results to the architectural state (the main register file and memory) in strict program order.

ROB Mechanism

The ROB is a circular buffer that operates on a First-In, First-Out (FIFO) basis. It bridges the gap between out-of-order execution completion and in-order architectural update.

Issue: When an instruction is decoded, it is allocated an entry at the tail of the ROB. This ROB entry number becomes the instruction's new tag. The register renaming table (RAT) now points to ROB entries, not reservation stations.
Execute: Instructions are still sent to reservation stations and execute out-of-order as before.
Write Result: When a functional unit completes, it broadcasts its result and its ROB tag on the CDB. The result is written into the corresponding entry in the ROB, not the register file. The ROB entry is marked as "ready". Any waiting reservation stations also snoop the CDB and grab the result.
Commit: The processor examines the instruction at the head of the ROB. If its entry is marked "ready," the instruction is committed. This means its result is finally written from the ROB to the architectural register file or memory. The instruction is then removed from the ROB (the head pointer advances). If the instruction at the head is not yet ready, the commit stage stalls, and no subsequent instructions can be committed, thus enforcing in-order retirement.

Each entry in the ROB typically contains these fields:

Busy: Indicates if the entry is valid.
Instruction Type: Specifies if it's a branch, store, or register operation.
State: Tracks the instruction's progress (e.g., Issue, Execute, WriteResult, Commit).
Destination: The architectural register number or memory address to be written.
Value: The computed result, held here until commit.
Ready: A bit indicating the result is valid in the Value field.
Exception: Stores any exception information generated during execution.

Section 4.5: Achieving Precise Exceptions and Speculation

The addition of the ROB is the key that unlocks two of the most powerful features of modern high-performance processors: precise exceptions and speculative execution.

Precise Exceptions

The ROB provides a simple and elegant mechanism for handling exceptions precisely. When an instruction (e.g., a DIV.D by zero) encounters an exception during its execution, the exception is not acted upon immediately. Instead, the exception status is simply recorded in the Exception field of the instruction's entry in the ROB. The processor continues to execute and complete other instructions out of order. The exception is only handled when the faulting instruction reaches the head of the ROB and is ready to be committed. At that point, the processor knows the exception is not speculative and is the next one to occur in the program's sequential order. It can then flush the entire pipeline and ROB, save a precise state, and jump to the operating system's exception handler.

Branch Speculation

The ROB is also the enabler of efficient branch speculation. When the processor encounters a branch, a branch predictor guesses the outcome. The processor then speculatively fetches, issues, and executes instructions from the predicted path, filling the ROB with these speculative instructions.

If the prediction is correct: The branch instruction eventually reaches the head of the ROB and is committed. The speculative instructions that follow it then commit normally as they reach the head. No time was lost.
If the prediction is incorrect: When the branch instruction is finally executed and the misprediction is discovered, the processor performs a recovery. It flushes all speculative instructions from the pipeline, reservation stations, and the ROB (this is as simple as resetting the ROB's tail pointer to its head pointer). No architectural state was corrupted because none of the speculative instructions were ever committed. The processor then begins fetching from the correct path.

The combination of a Tomasulo-style dataflow execution core with a Reorder Buffer for in-order commit forms the foundation of virtually all modern high-performance, out-of-order (OOO) processors. This two-part architecture elegantly solves the problems of data dependencies, false dependencies, and imprecise state, allowing for a massive increase in instruction-level parallelism.

Appendix A

WAR Hazard

A WAR hazard, or "anti-dependence," happens when an instruction wants to write to a register before an earlier instruction has finished reading that register's original value.

Here is a simple example:

# Instruction 1 w/ long latency
1.  FMUL.D  F2, F4, F6   // Multiplies F4 and F6, result goes to F2

# Instruction 2 w/ short latency, independent operands
2.  FADD.D  F4, F8, F10  // Adds F8 and F10, result goes to F4

A Tomasulo processor's goal is to maximize performance by executing instructions as soon as their operands are ready.

Instruction 1 (FMUL.D) is issued. Let's say it's a long operation that will take 10 cycles. It needs to read F4.
Instruction 2 (FADD.D) is issued right after. The processor sees its source operands (F8 and F10) are ready and that the addition functional unit is free. It's a short 2-cycle operation.

The processor executes FADD.D immediately, without waiting for the FMUL.D to finish.

So here is the WAR hazard: The FADD.D finishes in 2 cycles and wants to write its result to register F4. But the FMUL.D instruction hasn't even started its long execution yet and still needs the original value from F4! If the FADD.D were allowed to write to the actual architectural register F4, it would corrupt the input for the FMUL.D, leading to an incorrect program result.

WAW Hazard

A WAW hazard, or "output dependence," happens when two different instructions want to write to the same destination register, and the instruction that came later in the program finishes execution first.

Here is a simple example:

1.  FMUL.D  F2, F4, F6    // Writes to F2

2.  FADD.D  F2, F8, F10   // Also writes to F2

The correct final value in F2 should be the result of the FADD.D instruction, since it comes later in the program.

The FADD.D (Instruction 2) is short and finishes in 2 cycles. It's ready to write its result.
The FMUL.D (Instruction 1) is long and finishes 10 cycles later.

So here is the WAW hazard: If the FADD.D writes its result to F2, and then 8 cycles later the FMUL.D also writes its result to F2, the final value in the register will be from the FMUL.D. This is incorrect! The result from the instruction that was supposed to happen first has overwritten the result from the instruction that was supposed to happen last.

Enabling KVM GPU Passthrough

Sun, 27 Apr 2025 10:55:00 GMT

Credits

In this article, the "Enabling IOMMU" and the "GPU Passthrough" sections are adapted from Drakeor's Blog with some clarifications and modifications. The original article is very well written and I highly recommend reading it.

If this article is helpful, make sure to check out Drakeor's blog and support him. Thanks to Drakeor for the great work!

Enabling IOMMU

Setup

In my setup, I have a host machine with an NVIDIA GeForce RTX 4090 GPU and a guest machine running Ubuntu 24.04 Server for AI training. The host machine is running Ubuntu 24.04 LTS with 6.11 kernel. The host machine has a integrated Intel UHD Graphics 770 GPU, which is used for the host display. The NVIDIA GPU is passed through to the guest machine.

The host machine has the following hardware:

CPU: Intel Core i9-14900K
Motherboard: Gigabyte Z790 AORUS XTREME
GPU: ZOTAC GeForce RTX 4090
RAM: 64GB DDR5
Storage: 2TB NVMe SSD

Enabling IOMMU is a crucial step for GPU passthrough. It allows the host machine to access the GPU directly. It takes two steps to enable IOMMU: enabling it in the BIOS and enabling it in linux.

BIOS Settings

This tutorial assumes that you have IOMMU support for both your motherboard and CPU. Most modern server motherboards should support it, but your mileage may vary with desktop motherboards. Here are the options in BIOS corresponding to IOMMU related features:

Intel Based: Enable "Intel VT-d". May also be called "Intel Virtualization Technology" or simply "VT-d" on some motherboards.
AMD Based: Enable "SVM". May also be called "AMD Virtualization" or simply "AMD-V". Note: I've seen "IOMMU" as it's own separate option on one of my motherboards, but not on any of my other motherboards. Make sure it's enabled if you do see it. If you don't see it, it's likely rolled into one of the former VT-d or AMD-V options listed above.

Some modern computers may have IOMMU enabled by default, so you may first verify whether it is enabled or not. If you are not sure, you can check the BIOS settings.

Checking for IOMMU Support on your CPU

On Ubuntu/Debian for my Intel processor, it's as easy as this:

cat /proc/cpuinfo | grep --color vmx

If you see colored vmx in the output, you have IOMMU support. If you see nothing, your CPU does not support IOMMU.

The AMD equivalent is this:

cat /proc/cpuinfo | grep --color svm

There are one other BIOS settings that I recommend enabling before you move on to the next section.

Make sure the Primary GPU is set to integrated and not using your passthrough graphics card. This is called "Boot GPU" and "Primary Graphics" in my BIOS. Also remember to plug your monitor into the integrated graphics port on your motherboard. This is important because the host machine will use the integrated graphics for display and the passthrough graphics card will be used by the guest machine.

It is also worth notice that some motherboards have a setting called "Above 4G Decoding" or "Resizable Bar Support". This is not the same as IOMMU. It is used for PCIe devices that require more than 4GB of address space. It is not required for IOMMU to work, but it is recommended to enable it if you have a GPU with more than 4GB of VRAM.

Once you've enabled the above settings, save and exit the BIOS. This is a one-time operation. You will not need to do this again unless you reset your BIOS settings.

Linux GRUB Settings

Add the following options to your GRUB_CMDLINE_LINUX option in the /etc/default/grub file:

nano /etc/default/grub

For Intel CPUs, add the following options:

GRUB_CMDLINE_LINUX="... intel_iommu=on iommu=pt video=efifb:off"

The ... in the above line is the existing options. Make sure to keep them.

For AMD CPUs, add the following options:

GRUB_CMDLINE_LINUX="... amd_iommu=on iommu=pt video=efifb:off"

And then update GRUB:

sudo grub-mkconfig -o /boot/grub/grub.cfg

Make sure to reboot your system.

Then, to check that IOMMU is enabled, we can run the following command

sudo dmesg | grep -i -e DMAR -e IOMMU

You should see at least a message or two about it loading like below:

Feb 10 17:55:23.119993 opaleye kernel: pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
Feb 10 17:55:23.123622 opaleye kernel: pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
Feb 10 17:55:23.123691 opaleye kernel: perf/amd_1ommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank) •
Feb 10 17:55:23.124108 opaleye kernel: AMD-Vi: AMD IOMMUv2 loaded and initialized

GPU Passthrough

Find IOMMU Groups

Finding IOMMU Groups

Before looking at the IOMMU Groups, I want to make sure that my graphics card is visible to the OS. I run the following command:

lspci -nnk | grep VGA

For me, this results in 2 graphics controllers being shown:

00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-S GT1 [UHD Graphics 770] [8086:a780] (rev 04)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)

The first one is the integrated graphics card and the second one is the NVIDIA GPU. To list all the IOMMU groups they are part of, I'll run the following command (TheUnknownThing notes: I've modified the command because the original one from drakeor's blog was not working for me):

for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done | sort -V

As is shown in the figure, my RTX 4090 is in IOMMU group 12.

Loading the Correct Kernel Modules

Okay, so now that we have IOMMU all set, we need to make sure to load the correct modules for our passthrough graphics card. By default, nouveau will try to grab the graphics card when we boot.

I created a new file called /etc/modprobe.d/vfio.conf and added the following lines:

blacklist nouveau
options vfio_pci ids=10de:2684,10de:22ba

Note that I got the IDs from the IOMMU Group above. I need to pass in EVERY device in that IOMMU group or it won't work! Even though I'm not using audio, I still need to pass in the audio device in that group.

Side note: why we need to block nouveau? Because it will try to grab the graphics card and we don't want that. We want vfio-pci to grab it instead.

In /etc/modules-load.d/modules.conf, we'll ensure vfio_pci is loaded at boot:

Add vfio_pci to the file:

echo "vfio_pci" | sudo tee -a /etc/modules-load.d/modules.conf

Now reboot your system.

Now run the following to make sure the correct module is being used:

lspci -nnk

Make sure you see vfio-pci in the driver column for your graphics card.

Passing the GPU to the Guest VM

If you haven't installed the virt-manager or created your VM yet, please move on to the Creating a VM section.

So recall that the PCI address is on the left-side of when I ran lspci -Dnn earlier:

0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)

We want to take that value (0000:01:00.0) and convert all the colons and dots into underscores. So for 0000:01:00.0, it will be 0000_01_00_0.

Now we need to detach the PCI device from the host machine. We can do this with the following virsh command:

virsh nodedev-detach pci_0000_01_00_0

Then we'll edit the VM we want to attach the GPU to with the following virsh command:

virsh edit <vm_name>

Under the devices tag, we'll add the GPU. Note that address, bus, slot, and function matches the PCI address we saw earlier. You could add the following to wherever you want in the devices section, but I like to put it at the end.

..
<devices>
...
    <hostdev mode='subsystem' type='pci' managed='yes'>
        <driver name='vfio'/>
        <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
        </source>
    </hostdev>
...
</devices>
...

Now save the file and reboot your VM, and you should see the NVIDIA GPU in the VM. Remember to install the NVIDIA drivers in the guest machine. For a quick test, I will run the following command in the guest machine:

sudo apt update
sudo ubuntu-drivers autoinstall

And testthe following command to check if the NVIDIA drivers are installed correctly:

sudo nvidia-smi

Creating a VM

Prerequisites: Check Hardware Virtualization Support

KVM requires hardware virtualization extensions (Intel VT-x or AMD-V) to be enabled in your system's BIOS/UEFI. As we discussed earlier, I'll assume you have this enabled.
Check if the KVM modules are loaded (after installation step below):

lsmod | grep kvm

You should see kvm_intel or kvm_amd listed.

Install Libvirt

Ensure your package list is up-to-date:

sudo apt update

You'll need the Libvirt daemon, the QEMU/KVM hypervisor, and management tools.

The Libvirt package installation includes several components:

qemu-kvm: The KVM hypervisor backend.
libvirt-daemon-system: The main Libvirt daemon that runs as a system service.
libvirt-clients: Command-line tools for managing Libvirt (like virsh).
bridge-utils: Utilities for creating and managing network bridges (often needed for VM networking).
virtinst: Tools to create virtual machines (like virt-install).
virt-manager: (Optional, but Recommended) A graphical user interface for managing VMs.

sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils virtinst virt-manager

This command installs all the essential components, including the graphical virt-manager. If you are setting up a headless server, you can omit virt-manager.

Add Your User to the `libvirt` Group

By default, only the root user can manage system-wide Libvirt virtual machines. To allow your regular user account to manage VMs without using sudo for every command, add it to the libvirt group.

sudo adduser <your_username> libvirt

Replace <your_username> with your actual username.

Important: You need to log out and log back in for this group change to take effect. Alternatively, you can activate the group membership for your current shell session using newgrp libvirt (but logging out/in is generally recommended).

Verify the Installation

Check the Libvirt daemon status by executing the following command:

sudo systemctl status libvirtd

It should show as active (running). If not, try starting and enabling it:

sudo systemctl start libvirtd
sudo systemctl enable libvirtd

And check Libvirt connection (as your user, after logging back in):

virsh list --all

This command should run without errors (even if it shows an empty list of VMs). If you get a permission error, double-check that you've logged out and back in after adding your user to the libvirt group.

Create a Virtual Machine

First, download the ISO image for the OS you want to install. For this tutorial, I will use Ubuntu 24.04 Server. You can download it from the official Ubuntu website.

I will recommend using the virt-manager GUI for creating and managing VMs, as it simplifies the process significantly. However, if you prefer command-line tools, you can use virt-install. To simplify the process, I will use virt-manager.

Launching `virt-manager`

To launch virt-manager, run the following command in your terminal:

virt-manager

This will open the graphical interface for managing virtual machines. And the experience is quite straightforward, so I won't go into detail here. Just follow the prompts to create a new VM.

Accessing VM through `virsh console`

The virsh console command connects you to a serial console device that libvirt exposes to the virtual machine. For this to work bidirectional (input and output), two things need to be properly configured:

Virsh console

In the Virtual Machine's Libvirt XML, tt needs to have a <console type='pty'> or similar device defined, connected to a serial port (like target port='0'). You can double-check this by running virsh dumpxml ubuntu24.04 and looking within the <devices> section for a <console> or <serial> entry.

<serial type='pty'>
  <source path='/dev/pts/3'/>
  <target type='isa-serial' port='0'>
    <model name='isa-serial'/>
  </target>
  <alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/3'>
  <source path='/dev/pts/3'/>
  <target type='serial' port='0'/>
  <alias name='serial0'/>
</console>

If this is missing, you'll need to add it using virsh edit ubuntu24.04.

Inside the Guest VM

Edit the GRUB configuration:

Open the GRUB default file in a text editor:

sudo nano /etc/default/grub

Find the line that starts with GRUB_CMDLINE_LINUX_DEFAULT. It might look something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

You need to add console redirection parameters. Add console=tty0 console=ttyS0,115200. _ console=tty0: Ensures output also goes to the primary virtual console (if you still have one, which you likely do for initial setup). _ console=ttyS0,115200: Directs kernel and boot messages to the first serial port (ttyS0) at a baud rate of 115200. This corresponds to the port='0' in the libvirt XML.

The line should become something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash console=tty0 console=ttyS0,115200"

If you already have other parameters in this line, just add the console=... parts inside the quotes, separated by spaces.

Enable a Serial Getty Service:

Ubuntu uses systemd to manage services. You need to enable the service that provides a login prompt on the serial port.

sudo systemctl enable serial-getty@ttyS0.service
sudo systemctl start serial-getty@ttyS0.service

The enable command ensures it starts on boot, and start attempts to start it immediately.

After editing the GRUB configuration file, you must update the GRUB bootloader:

sudo update-grub

Also update Initramfs (necessary for console changes to take full effect early in boot):

sudo update-initramfs -u

And remember to reboot your VM, and it should now be accessible via the virsh console command.

CS188 Notes 4 - Reinforcement Learning

Sun, 20 Apr 2025 00:00:00 GMT

Note:

You could view previous notes on CS188: Lecture 9 - Markov Decision Processes (MDPs).

Also note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

Reinforcement Learning

In this note I will go through the key concepts in the Reinforcement Learning (RL) lecture. I will also try to clarify my understanding of the Q-learning algorithm, which is a key concept in RL.

First let's categorize the topics. I'll use the same categories as in the lecture slides also adding some of my own notes.

Passive Learning
- Model-based
- Model-free
Active Learning
Approximate Q-learning
Policy Gradient

ONE SENTENCE SUMMARY: Passive learning involves evaluating a fixed policy (likely human will control it), while active learning seeks to improve the policy through exploration (likely model itself would operate); model-based methods use environment models, model-free methods learn directly from experience, approximate Q-learning generalizes learning to large state spaces, and policy gradient methods optimize policies directly using gradient ascent.

I believe this is a good summary of the key concepts in RL. I will go through each of these categories in detail below. Also, I will use the structure of "HOW? -> WHY? -> PROBLEM" to explain each concept.

Passive RL

Model-based

How?

The agent learns a model of the environment (e.g., transition probabilities, rewards) and uses this model to evaluate the policy. This is done by estimating the expected value of each action in each state based on the model.

Then Solve for values as if the learned model were correct. (Trust the model)

Why?

Answering "why" in this section is basically answering "why do we need a model?" The answer is that we do not have a model of the environment, so we need to learn it. This is done by estimating the transition probabilities and rewards based on the observed data. This is a key concept in RL, as it allows the agent to learn from its experiences and improve its policy over time.

Problem

The problem with this approach is that it requires a lot of data to learn the model accurately. If the model is not accurate, the agent may make suboptimal decisions based on the learned model.

Model-free

How?

In this case there are no models to guide us "what to do". We need to learn the value function directly from the data.

The simplest thought is to Average together observed sample values. Every time you visit a state, write down what the sum of discounted rewards turned out to be, and average it out. But what's bad about this is that it do not take account of state connections. For example, there is a graph A -> B -> C (end). How to calculate $V$ for state $A$ and $B$? We would evaluate every single starting state separately, for example, when evaluating A, we would NOT take the previous evaluation of B in to account, it only cares the final outcome and to average it. This is not a good idea, because we are wasting a lot of data. We could use the data from state B to help us evaluate state A. So we need to take into account the connections between states.

So an evolution of this is to use the Bellman equation. The idea is to use the value of the next state to help us evaluate the current state. This is done by using the Bellman equation similar to the one we used in the MDP lecture. However, we need modifications for this.

The ORIGINAL Bellman equation is:

$$ V(s) = \sum_{s'} T(s, a, s')[R(s, a, s') + \gamma V(s')] $$

And its ADAPTED version is:

$$ V(s) = \frac{1}{n}\sum \mathrm{sample}{s'} \quad \text{where} \ \mathrm{sample}{s'} = R(s, a, s') + \gamma V(s') $$

What's improved from the naive version is that we are utilizing the existing data to evaluate. However, as we notice that there are problems with this: We are waiting until the end of an episode to update values as we are using the average of all samples. We could update values more frequently.

So this is where the Temporal Difference (TD) Learning comes in. The idea is to update the value of the current state based on the value of the next state, without waiting for the end of the episode. Because updates happen after every transition, states and transitions that are experienced more frequently will have a greater influence on the learned values over time.

The specific type of TD learning shown here is for policy evaluation. This means we have a fixed policy π (a fixed way of choosing actions in each state), and we want to figure out the value function Vπ(s) for that policy. We are not trying to find the best policy yet, just evaluating the current one.

In TD, we have samples, and the update rule.

$$ \mathrm{sample} = R(s, \pi(s), s') + γV^\pi(s') $$

The sample (or TD Target) is: "the reward I just got, plus the discounted value of where I landed (according to my current beliefs)".

The update rule is:

$$ V^\pi(s) \leftarrow (1 - α)V^\pi(s) + α \ \mathrm{sample} $$

We calculate the TD Error: sample - Vπ(s). This error represents the difference between our target (sample) and our current estimate (Vπ(s)). We then adjust our current estimate Vπ(s) by moving it a small step (α) in the direction of that error.

It shows that TD learning is essentially maintaining a running average of the TD targets it observes for each state. It gradually "forgets" older, potentially less accurate, information because initial value estimates might be far off. Using a decreasing learning rate α over time can help the value estimates converge more stably.

However, there are still problems. Mentioned in the previous lecture, what really GUIDES the agent is the $Q$-values. So we need to learn the $Q$-values instead of the $V$-values.

Q-Learning (Active RL)

How?

Q-Learning is a model-free reinforcement learning algorithm used to learn the optimal action-value function (Q-values). Unlike TD learning which focuses on state values, Q-learning focuses on (state, action) pairs.

The Q-learning update rule is:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] $$

Similar to above, where: $Q(s, a)$ is the current estimate of the Q-value for state $s$ and action $a$, $\alpha$ is the learning rate, and the term $r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ is the TD error. You might wonder "why do we need to use the max operator here?" The answer is that we are trying to learn the optimal Q-value for each state-action pair. The max operator allows us to select the best action in the next state $s'$ based on the current Q-values.

Why?

Q-learning allows us to select the best action in each state, unlike TD learning which only evaluates a fixed policy. It's called "off-policy" because it learns the optimal policy regardless of how the agent is currently behaving (exploration). The agent can follow any exploratory policy during training while still learning the greedy optimal policy.

With Q-values, we can derive our policy directly:

$$ \pi(s) = \arg\max_a Q(s, a) $$

This means choosing the action that maximizes the expected future rewards for each state.

Problem

The main challenge in Q-learning is balancing exploration and exploitation, i.e., balancing the behaviour that "Trying new actions to discover potentially better rewards" and "Using known Q-values to maximize rewards based on past experience"

This is typically addressed using an $\epsilon$-greedy policy or exploration functions:

The $\epsilon$-greedy policy works as follows:

With probability $1-\epsilon$, choose the best action (exploit)
With probability $\epsilon$, choose a random action (explore)
Gradually decrease $\epsilon$ over time to favor exploitation as learning progresses

And the exploration function works as follows:

Define "exploration bonus" based on the uncertainty of Q-values. Let $n$ be the number of times action $a$ has been taken in state $s$. The exploration bonus can be defined as $\frac{1}{n(s, a)}$.
When choosing actions, add the exploration bonus to the Q-value: $Q(s, a) + \frac{1}{n(s, a)}$.
This encourages the agent to explore less frequently visited actions, balancing exploration and exploitation.
Gradually decrease the exploration bonus over time to favor exploitation as learning progresses

This approach can be more efficient than $\epsilon$-greedy, as it focuses exploration on less certain actions rather than uniformly random actions. So it IS used in practice.

Experience Replay

Experience replay is a optimization technique used in reinforcement learning, particularly in deep Q-learning. So I'll add it as a subtopic here.

How?

Experience replay enhances Q-learning by storing the agent's experiences (transitions) in a replay buffer. Instead of updating Q-values using only the most recent experience, the agent stores the recent experience to buffer, and randomly samples batches of past experiences from this buffer for training.

Why?

In Q-learning, the agent learns from its experiences sequentially. This can lead to correlations between consecutive experiences, making learning inefficient. By using experience replay, the agent can break these correlations and learn from a more diverse set of experiences. Consecutive experiences are often similar, making learning inefficient. Random sampling creates more independent training examples.

This is especially important in deep reinforcement learning where neural networks are used to approximate Q-values.

With all the problems addressed above, we still could not put the Q-learning algorithm into practice. The problem is that the state space is too large. We cannot store the Q-values for every single state-action pair. So we need to use function approximation to generalize across similar states, and this is where Approximate Q-Learning comes in.

Hold on a second

But before that, I do believe I need to clarify some points here.

You might think: "Why Q-Learning is discussed in active learning? Q-learning could be used in passive learning, while TD could also be used in active learning, is that correct?"

Yes, you are right. In CS188 (and many RL courses), the algorithms are typically presented in this order:

TD Learning is introduced first as a way to learn value functions for passive learning

Q-Learning is introduced next as a way to extend these ideas to active learning

This pedagogical approach sometimes creates the impression that these algorithms are strictly tied to their respective learning categories, but they're more flexible than that.

The main difference is that TD learning (as typically presented) learns state values V(s) while Q-learning learns state-action values Q(s,a). Q-values naturally lend themselves to policy improvement (just take argmax), which is why Q-learning is often presented in the active learning context.

Approximate Q-Learning

How?

In environments with large or continuous state spaces, it's impractical to maintain a separate Q-value for each state-action pair. Approximate Q-learning uses function approximation to generalize across similar states.

Simple solution is that recall the "feature function" that we discussed in the game tree lecture. We describe a state using a vector of features (properties) $f_1, f_2, \ldots, f_n$ and learn a linear function of these features:

$$ Q(s, a) = w_1 f_1(s, a) + w_2 f_2(s, a) + \ldots + w_n f_n(s, a) $$

Where $w_1, w_2, \ldots, w_n$ are weights that we learn through experience. This is a linear function approximation. We can also use non-linear function approximators like neural networks, but the basic idea is the same: learn a function that maps states (and actions) to Q-values.

And you might wonder: "How to learn the weights?" The answer is that we can use the same Q-learning update rule, but instead of updating the Q-value directly, we update the weights using some tricks. This tricks is a simple notion of "if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features".

So the update rule becomes:

$$ w_i \leftarrow w_i + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] f_i(s, a) $$

Where $f_i(s, a)$ is the value of the $i$-th feature for state $s$ and action $a$. This means we are updating the weights based on the features that were present in the current state-action pair.

The update rule of $Q$ is still the same:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \ \mathrm{Difference} $$

Where $\mathrm{Difference} = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$

Why?

Approximate Q-learning allows reinforcement learning to scale to complex environments with huge state spaces (like Atari games, robotics, etc.) where tabular methods would be impossible.

It enables generalization across similar states, so learning in one state can improve performance in similar states, even those the agent hasn't encountered yet.

Problem

Approximate Q-learning faces these two challenges:

Forgetting - learning in one region of the state space can undo learning in another region
Feature selection - choosing the right representation for states is critical for good generalization

Policy Gradient Methods

How?

Instead of learning a value function and deriving a policy from it, policy gradient methods directly parameterize the policy itself. That is, the agent's behavior is described by a function $\pi(a|s; \theta)$, where $\theta$ are the parameters (often the weights of a neural network). The goal is to adjust $\theta$ so that the expected return (the sum of rewards) is maximized.

The core idea is to use gradient ascent: we estimate how changing the parameters would affect the expected return, and then nudge the parameters in that direction. The update rule looks like this:

$$ \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) $$

where $J(\theta)$ is the expected return under the current policy.

But how do we compute this gradient? The answer is the policy gradient theorem, which tells us that the gradient of the expected return can be estimated using samples from the environment:

$$ \nabla_\theta J(\theta) \approx \mathbb{E}{\pi\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot G \right] $$

Here, $G$ is the return (sum of discounted rewards) following the action $a$ in state $s$. In practice, we run episodes, collect rewards, and use these samples to estimate the gradient.

This approach is called REINFORCE, the simplest policy gradient algorithm. Each time the agent takes an action, it computes the gradient of the log-probability of that action, multiplies it by the return, and uses that as the update direction.

Why?

Policy gradient methods are powerful for several reasons. First, they allow us to optimize the policy directly, which is what we ultimately care about. This is especially useful in environments with continuous or high-dimensional action spaces, where value-based methods struggle. Policy gradients can also learn stochastic policies, which can be optimal in environments with inherent randomness or partial observability.

Another advantage is that policy gradient methods can be combined with function approximation (e.g., neural networks) to handle very large or continuous state spaces. This is the foundation of modern deep reinforcement learning algorithms.

CS188 Notes 2 - Markov Decision Processes (MDPs)

Sat, 19 Apr 2025 00:00:00 GMT

Note:

You could view previous notes on CS188: Lecture 4 - Constraint Satisfaction Problems (CSPs)

CS188: Lecture 8 - Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) represents sequential decision-making in environments where actions produce stochastic (random) outcomes, and an agent's goal is to maximize its cumulative reward over time. In an MDP, the agent faces uncertainty: it cannot always predict the result of its actions, but it must still try to act optimally.

The key components of an MDP are:

States $S$: Possible situations the agent can find itself in.
Actions $A$: The set of possible moves or decisions the agent can make in each state.
Transition Function $T(s, a, s')$: The probability that action $a$ in state $s$ leads to state $s'$.
Reward Function $R(s, a, s')$: The reward received after transitioning from $s$ to $s'$ via action $a$.
Discount Factor $\gamma$: How much the agent values future rewards compared to immediate rewards.

We yield the value function $V(s)$ for each state $s$, which represents the expected cumulative reward starting from state $s$ and following the optimal policy thereafter. And the action-value function $Q(s, a)$ for each action state $(s,a)$, which represents the expected cumulative reward starting from state $s$, taking action $a$, and then following the optimal policy thereafter.

You might think "why not just use the value function $V(s)$?" The reason is actions are easier to select from $Q$-values than values! You will see this in the following part of this lecture.

The goal is to find an optimal policy $\pi^*$, which is a mapping from states to actions ($\pi(s) = a$), maximizing the expected cumulative (usually discounted) reward from any state. In this sense, an MDP defines both the "game rules" and what it means to "play well" in that environment.

Stationary Preferences

The assumption of stationary preferences means that your relative preference between two future sequences of rewards doesn't change just because you receive the same immediate reward before both. This property imposes a recursive structure on the utility function for reward sequences.

Formally, the utility $U$ of a sequence $[r_0, r_1, r_2, ...]$ must satisfy:

$$ U([r_0, r_1, r_2, ...]) = f(r_0, U([r_1, r_2, ...])) $$

where $f$ is some consistent function. If we assume $f$ is linear, this recursion unrolls to only two possible forms for the utility function (after appropriate normalization):

Additive Utility: $U = r_0 + r_1 + r_2 + \cdots$ (corresponds to $\gamma = 1$)
Discounted Utility: $U = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$ (where $0 \leq \gamma < 1$)

The discounted utility is the standard in MDPs, as it ensures convergence for infinite horizons and reflects the diminishing importance of rewards further in the future.

Why MDPs?

MDPs are particularly suitable when:

The environment is stochastic: the same action in the same state can yield different results.
Rewards may be delayed: the value of an action now may be realized only after multiple future steps.

Unlike simple search algorithms (e.g., greedy or expectimax), MDPs explicitly model both uncertainty (via $T$) and the accumulation of rewards over time (via $R$ and $\gamma$). While solving an MDP requires knowledge of $T$ and $R$, reinforcement learning (RL) methods learn optimal policies directly from experience, using the MDP framework as a theoretical foundation.

MDPs vs Expectimax

Both MDPs and expectimax handle uncertainty and aim for maximum expected utility. Expectimax, however, is typically used to compute the expected value of actions from a specific starting point, often with a tree structure and a finite horizon. MDPs, in contrast, compute a policy—the best action for every possible state—naturally handling cycles and infinite (discounted) horizons.

In short: expectimax is a limited lookahead from the current state; solving an MDP finds a full strategy for all states.

MDPs and Multi-Agent Games

Standard MDPs are designed for a single agent interacting with a stochastic environment. They do not directly accommodate multiple strategic agents whose actions affect each other's outcomes. Multi-agent situations typically require other formalisms, such as stochastic games or Markov games.

MDPs vs Greedy Search

Greedy algorithms make decisions based solely on immediate rewards, without considering long-term consequences. MDPs, by calculating the expected sum of (possibly discounted) future rewards, are inherently long-sighted. Optimizing for the value function $V^(s)$ or the action-value function $Q^(s,a)$, MDPs look ahead through the space of future possibilities, not just the next step.

ONE SENTENCE SUMMARY:
Markov Decision Processes are mathematical models for sequential decision-making under uncertainty, aiming to find policies that maximize expected (possibly discounted) cumulative reward, and forming the theoretical foundation for reinforcement learning.

CS188 Notes 3 - Markov Decision Processes (MDPs) II

Sat, 19 Apr 2025 00:00:00 GMT

Note:

You could view previous notes on CS188: Lecture 8 - Markov Decision Processes (MDPs).

Markov Decision Processes (MDPs)

After the previous lecture, I realized I had some misunderstandings about the Policy Iteration algorithm, especially when compared to Value Iteration. So here, I'll clarify my understanding of these two core approaches for solving MDPs.

Why use a "fixed policy" in Policy Iteration?

It can be confusing at first that Policy Iteration evaluates a fixed policy. You might ask: does using a fixed, possibly non-optimal policy ever lead to the optimal one?

The answer is that evaluating a fixed policy is an essential intermediate step towards finding the optimal policy. We might "evaluate" a policy that is not optimal, but we it yields valuable information about the expected future rewards of that policy, so finnaly what we act on is the optimal policy.

In Policy Iteration, we loop between two key phases:

Step 1: Policy Evaluation

We begin with an initial policy $\pi$ (random, greedy, whatever). For this $\pi$, we compute the exact utility $V^{\pi}(s)$ for each state $s$ under the assumption that we always follow $\pi$. The Bellman equation for this is:

$$ V^{\pi}(s) = \sum_{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}(s') ] $$

This evaluates the policy's long-term value at every state, given that policy.

Step 2: Policy Improvement

Now that we have $V^{\pi}$, we look at each state $s$ and ask: "Is there an action $a$ that would improve my expected future rewards if I took it immediately, then continued with $\pi$?"

For each state, we consider:

$$ Q^{\pi}(s, a) = \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^{\pi}(s') ] $$

We then build a new policy by setting:

$$ \pi_{\text{new}}(s) = \arg\max_a Q^{\pi}(s, a) $$

That is, for each state, choose the action that looks best based on the values under the old policy. This is the policy improvement step.

Repeat: We now re-evaluate the new policy $\pi_{\text{new}}$, and the process continues until the policy stops changing. This guarantees convergence to the optimal policy $\pi^$ and optimal value function $V^$. Evaluating a fixed policy at each stage is essential for knowing both how good our current strategy is and how to improve it.

What is the difference between Policy Iteration and Value Iteration?

In short:

Value Iteration is always searching for the best action at each step, directly refining the estimate of the optimal value function.
Policy Evaluation (as used in Policy Iteration) simply calculates the consequences of following a predefined plan $\pi$, without improvement during evaluation itself. Policy improvement occurs as a separate step.

Let's break down the differences in detail.

Value Iteration Equation:

$$ V_{k+1}(s) = \max_{a} \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V_{k}(s') ] $$

Goal: Directly compute the optimal value function $V^*(s)$.
How: Each iteration, for each state $s$, considers all possible actions $a$. For each action, it calculates the expected value (reward + discounted future value), then takes the maximum over all actions.
Policy: Implicit. The $\max$ operation is finding the best action, and the final optimal policy $\pi^*$ is extracted after $V_k$ converges.
What it computes: Iteratively refines the best possible long-term value from each state.

Policy Evaluation Equation (for a fixed policy $\pi$):

$$ V^{\pi}{k+1}(s) = \sum{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}_k(s') ] $$

Goal: Compute the value function $V^\pi(s)$ for the given, fixed policy $\pi$ (which may not be optimal).
How: Each iteration, for each state $s$, uses only the action prescribed by $\pi$: $a = \pi(s)$. Calculates the expected value (reward + discounted future value) following this fixed action. There is no $\max$ because the action is predetermined by $\pi$.
Policy: Explicit and fixed throughout evaluation.

Comparison Table

Does Policy Evaluation converge after more iterations than Value Iteration?

It's tempting to think that Policy Evaluation takes more iterations to converge, since it does not optimize at every step, but in practice, Policy Iteration often converges in fewer outer iterations (policy updates) than Value Iteration, though the work per iteration can differ.

The real power of Policy Iteration comes after Policy Evaluation. Once we have $V^\pi$ for our current policy, we can often make a large jump to a better policy by improving all states at once:

$$ \pi_{\text{new}}(s) = \arg\max_a \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^\pi(s') ] $$

We only repeat this process until the policy stops changing, which often happens quickly and requires fewer overall iterations than Value Iteration.

CS188 Notes 1 - Constraint Satisfaction Problems (CSPs)

Fri, 18 Apr 2025 00:00:00 GMT

Note:

This is a work in progress. I will be adding more notes and examples as I go through the course. The course is available on the Berkeley website.

Note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

I will begin my notes with Lec.4 (CSPs I) and continue from there.

CS188: Lecture 4 - Constraint Satisfaction Problems (CSPs)

The goal is to find a complete assignment (every variable has a value from its domain) such that all constraints are satisfied. CSPs are a special kind of search problem where the path to the goal doesn't matter, only the final state.

Backtracking Search

The fundamental algorithm for solving CSPs systematically is Backtracking Search. It works as follows:

Start with an empty assignment.
Select an unassigned variable.
Try assigning a value from its domain.
Check if this assignment violates any constraints with already assigned variables.
- If no violation, recursively call backtracking for the next variable. If the recursive call succeeds, we're done (or continue if finding all solutions).
- If violation, or if the recursive call returns failure, try the next value for the current variable.
If all values for the current variable have been tried and failed, backtrack: return failure to the previous call, forcing it to try a different value.

This explores the space of partial assignments in a depth-first manner. While complete (guaranteed to find a solution if one exists), basic backtracking can be very slow.

Filtering (Constraint Propagation)

Filtering techniques aim to prune the search space before or during backtracking by removing values from domains that cannot possibly lead to a solution.

Forward Checking: When a variable X is assigned a value v, look at all unassigned neighboring variables Y connected to X by a constraint. Remove any value y from Y's domain that is inconsistent with X=v.
- Why: Simple, relatively cheap check that prevents immediate failures down the line.
- Limitation: It only checks constraints between the newly assigned variable and its future neighbors. It doesn't detect inconsistencies between two unassigned variables, even if their domains have been reduced (e.g., if both NT and SA are reduced to only {Blue}, Forward Checking won't notice the NT-SA conflict until one of them is assigned).
Arc Consistency (2-Consistency): An arc X -> Y is consistent if for every value x remaining in X's domain, there exists at least one value y remaining in Y's domain such that (x, y) satisfies the constraint between X and Y. So you could think of it as a "two-way" check, a update of the previous mentioned Forward Checking.
- How (AC-3 Algorithm Idea): Maintain a queue of all arcs. While the queue is not empty, pop an arc X -> Y. Check if it's consistent. If not, remove the inconsistent value(s) x from X's domain ("delete from the tail"). Crucially: If any value was removed from X, add all arcs Z -> X (where Z is a neighbor of X, other than Y) back into the queue, because the removal might make some values in Z inconsistent. Repeat until the queue is empty (no more values can be removed).
- Why: More powerful than Forward Checking. It propagates constraints between variables, potentially detecting failures much earlier (like the NT-SA {Blue} conflict). Can be used as preprocessing or maintained during search. However, it is more computationally expensive than Forward Checking, but it is often worth it.
K-Consistency & Strong K-Consistency: Generalizes consistency checks to k variables. K-Consistency means any consistent assignment to k-1 variables can be extended to a k-th variable. 1-Consistency = Node Consistency (unary constraints). 2-Consistency = Arc Consistency. 3-Consistency = Path Consistency.

Strong K-Consistency: Means the CSP is J-Consistent for all J from 1 to K.
- Fact: Strong n-Consistency (where n is the number of variables) guarantees a solution can be found without backtracking.
- My misunderstanding: Why "Strong"? Because the backtrack-free construction process requires the guarantee at every step k. Step k requires k-Consistency assuming the first k-1 assignments were consistent. Plain n-Consistency only guarantees the last step (n-1 to n) works, but doesn't guarantee the intermediate steps (like 2 to 3) are possible if the problem isn't also 3-Consistent, etc. A problem could be n-Consistent (vacuously, if no consistent n-1 assignments exist) but fail lower levels of consistency, requiring backtracking or even having no solution. Strong n-Consistency ensures all necessary intermediate guarantees hold.

Speeding Up Backtracking

These heuristics don't prune the search space but guide the backtracking search to potentially find solutions faster or detect failures earlier.

Variable Ordering: Minimum Remaining Values (MRV): Choose the next unassigned variable that has the fewest legal values left in its domain.
- Why ("Fail-Fast"): If a variable has 0 values, failure is detected immediately. If it has 1 value, it's forced, simplifying the problem. Variables with few values are often bottlenecks; dealing with them early is likely to prune large parts of the search tree quickly if they lead to failure. Also called "most constrained variable".
Value Ordering: Least Constraining Value (LCV): Once a variable is selected (e.g., by MRV), try assigning values from its domain in an order. Choose the value that rules out the fewest values in the domains of neighboring unassigned variables.
- Why ("Succeed-First"): Tries to keep options open for the future, increasing the chance that the current path leads to a solution without immediate backtracking. It prioritizes choices that seem less likely to cause conflicts later.

MRV and LCV often work very well together.

The Hidden Cost of try-catch

Sat, 12 Apr 2025 14:44:00 GMT

The Problem

So I was implementing my own version of standard library containers like std::map. It's a fantastic learning exercise! I get to the operator[], the access-or-insert function. And I was looking at the existing at() method (which provides bounds-checked access) and think, "Aha! I can reuse at() and just catch the exception if the key isn't there!"

It seems elegant, right? I wrote something like this:

T &at(const Key &key) {
    if (root == nullptr) {
        throw index_out_of_bound();
    }
    return find(key, root);
}

T &operator[](const Key &key) {
    try {
        return at(key);
    } catch (index_out_of_bound &) {
        // insert
        value_type value(key, T());
        pair<iterator, bool> result = insert(value);
        return result.first->second;
    }
}

I compile it, feeling pretty good about the code reuse. Then, I run your benchmarks, comparing my sjtu::operator[] against std::map::operator[], especially focusing on scenarios involving insertions (where the key doesn't initially exist), and boom - Time Limit Exceeded. Why? So I looked at the benchmark script, and it got something like

	//	test: erase()
	while (map.begin() != map.end()) {
		map.erase(map.begin());
	}
	assert(map.empty() && map.size() == 0);
	//	test: operator[]
	for (int i = 0; i < 100000; ++i) {
		std::cout << map[Integer(i)];
	}
	std::cout << map.size() << std::endl;

So probably you have already identified the problem now, but not so lucky for me. I was just thinking, "Oh, maybe the insert function is slow."

The Profiling

The benchmark results are shocking. This implementation is dramatically slower – in my case, it is 88% slower – than std::map specifically when operator[] results in inserting a new element. Accessing existing elements might be fine, but the insert path is killing performance.

What gives? Is your tree balancing algorithm inefficient? Is memory allocation slow? This is where debugging tools become essential. Simple code inspection doesn't immediately reveal why it's so much slower as it DO ACHIEVE $O(\log N)$ Time complexity.

Time to bring out the profilers. Tools like perf (on Linux) and callgrind (part of the Valgrind suite) are designed to answer the question: "Where is my program actually spending its time?"

Beginning with perf record ./code followed by perf report is a great start as it already provides simple CLI views to see which functions are "hot" – consuming the most CPU cycles. The perf report points towards functions with names _Unwind_Find_FDE, and various functions involved in stack unwinding and exception handling. This already reminded me to focus on some syntax issues (improper coding) instead of my code. However, I’m unfamiliar with something like _Unwind_Find_FDE, so I use callgrind to further view the instruction counts.

Running callgrind: I run valgrind --tool=callgrind ./code. And I am using macOS, so I use qcachegrind to visualize the results.
- The visualization confirms perf's findings but with more detail. I can see that when sjtu::operator[] calls sjtu::at and at executes throw, a massive cascade of function calls related to exception handling follows - costing 87% of execution time!!!!!
- Crucially, callgrind shows the cost associated not just with the throw itself, but with the entire stack unwinding process – the runtime searching for the catch block and meticulously destroying any local objects created within the try block and intervening function calls.

The "Aha!" Moment

The profilers leave no doubt. The performance bottleneck is the deliberate, designed-in overhead of the C++ exception handling mechanism being triggered repeatedly for a non-exceptional condition (key not found during an insertion).

What Actually Happens When C++ Throws an Exception? (And Why Profilers Flag It)

After chatting with some AI Chatbots and doing some googling, I realize that throwing and catching an exception isn't just a fancy goto. Instead, it involves a complex runtime process that the profilers pick up as costly operations:

Exception Object Creation: throw std::out_of_range(...) creates an object, often involving dynamic memory allocation (heap allocation shows up in profilers).
Stack Unwinding: (The main cost flagged by profilers) The runtime walks backward up the call stack.
- It destroys local objects (RAII cleanup). Profilers show time spent in destructors during unwinding.
- It consults compiler-generated "unwinding tables". Accessing and processing this data takes time/instructions.
Handler Matching: The runtime checks catch blocks using RTTI, adding overhead.
Control Transfer: Jumping to the catch block disrupts linear execution flow, potentially causing instruction cache misses and branch mispredictions (subtler effects seen in very low-level profiling).

The profiling results, combined with understanding the mechanics, paint a clear picture:

Stack Unwinding Overhead: As callgrind showed, walking the stack, looking up cleanup actions, and calling destructors is expensive, especially compared to a simple if check.
Runtime Machinery: The hidden machinery (dynamic allocation, RTTI, table lookups) adds significant overhead absent in direct conditional logic.
Optimization Barriers: Exception handling constructs can limit compiler optimizations compared to simpler control flow, contributing to higher instruction counts seen in callgrind.

In our operator[] example, the case where the key doesn't exist is expected. By using exceptions here, we frequently trigger the heavyweight process the profilers flagged, leading to poor performance.

So what does a normal operator[] look like? It should be something like this:

    T &operator[](const Key &key) {
        Node *node = find_node(key, root);
        if (node != nullptr) {
            return node->data.second;
        } else {
            // Insert new element
            value_type value(key, T());
            pair<iterator, bool> result = insert(value);
            return result.first->second;
        }
    }

and the profiler results should look like something like this:

As you can see in the image, the top CPU-consuming functions are now actual function in the code, not the exception handling machinery. The find_node function is now the most expensive operation, which is expected since it involves $O(\log N)$ tree traversal.

My First VSCode Extension - ACMOJ Helper from Scratch

Sun, 06 Apr 2025 06:25:00 GMT

Constantly switching between the editor (VS Code) and browser was incredibly tedious. Looking at problem descriptions, examples, and outputs in the browser, then comparing results, copying code to VS Code, writing and debugging, copying back to browser for submission, and finally switching back to browser to check results... Although I could use split screen, the Stage Manager experience on macOS wasn't great. This process not only interrupted my thought flow but was also inefficient.

Could I complete all these operations within VSCode? Seeing classmates in my class developing plugins, it didn't seem that difficult. Having recently learned Golang, TypeScript didn't seem too hard to learn either 😋 With this idea in mind, I created my first VSCode extension development journey, aiming to create a convenient assistant for ACMOJ. This article documents the process from conception to implementation, through pitfalls to the final working product.

Getting Started

VS Code extensions are primarily written in TypeScript (or JavaScript) and run in a Node.js environment. Before starting, the essential tools are:

Node.js & npm/yarn serve as the basic runtime environment and package manager. Yeoman & generator-code are the official scaffolding tools recommended by VS Code for quickly generating project structure. Simply run npm install -g yo generator-code followed by yo code and select TypeScript Extension. VS Code itself is needed for developing and debugging the plugin.

The generated project structure is clear and straightforward. The src/extension.ts file serves as the plugin's entry point, containing activate (called when activated) and deactivate (called when deactivated) functions. The package.json file is the core manifest file, defining the plugin's metadata, contributions (such as commands, views, configurations), and activation events (determining when to load the plugin). The tsconfig.json file contains TypeScript configuration.

My initial blueprint was to implement these core features:

Authentication: Connect to the ACMOJ API. Problem/Assignment Browsing: View problem lists in VS Code's sidebar. Problem Details: Display problem descriptions, examples, etc. in Webview. Code Submission: Quickly submit code from the current editor. Result Viewing: View submission status and results in sidebar or Webview.

API Interaction and Authentication

ACMOJ provides an OpenAPI-compliant API, which forms the foundation for implementing functionality.

API Client Setup

I chose axios as the HTTP request library and encapsulated an ApiClient class to uniformly handle request sending, Base URL configuration, and error handling. The key was setting up request interceptors to automatically attach Bearer <token> in the Authorization Header.

Authentication "Episode" - OAuth vs PAT

The API documentation mentioned both OAuth2 (Authorization Code Flow) and Personal Access Token (PAT) authentication methods.

Initially, I tried implementing the OAuth2 flow. This involved directing users to browser authorization, then starting a temporary HTTP server locally to listen for callback URIs to obtain the code, then using the code and client_secret to exchange for an access_token. While this flow is standard for applications requiring multi-user authorization, it's quite complex to implement, especially handling client_secret and local callbacks securely in a VS Code extension environment. (Actually, what stopped me initially was needing a client secret from the admin team. At that time, I didn't know anyone on the admin team, though they seem to know me now after developing this plugin XD)

Considering that target users (mainly myself and classmates) could easily generate PATs on the ACMOJ website, I decided to switch to the simpler PAT authentication. This greatly simplified the flow: create an AuthService (or TokenManager), provide an acmoj.setToken command using vscode.window.showInputBox({ password: true }) to prompt users for PAT input, use VS Code's SecretStorage API (context.secrets.store / context.secrets.get) to securely store and read PATs, provide an acmoj.clearToken command to clear stored PATs, directly get stored PATs from AuthService in ApiClient's request interceptor to add to request headers, and in response interceptor, if encountering 401 Unauthorized errors, call AuthService methods to clear invalid tokens and prompt users to reset.

Building User Interface with TreeView and Webview

To display information and provide interaction in VS Code, I mainly used TreeView and Webview.

TreeView (Sidebar)

I used the vscode.TreeDataProvider interface to create two views for the Activity Bar:

Problemsets (Contests/Assignments): Initially, I simply listed all problems but quickly found the information overwhelming. I improved it to display Problemsets that users joined. Further improvement involved categorizing Problemsets into "Ongoing", "Upcoming", and "Passed" top-level nodes based on their start/end times. This required fetching all Problemsets, then filtering and sorting them in the getChildren method based on current time and category nodes. I used two custom TreeItem types: CategoryTreeItem and ProblemsetTreeItem. Each Problemset node was set as expandable (vscode.TreeItemCollapsibleState.Collapsed), loading its contained problem list (ProblemBriefTreeItem) when clicked.

Submissions (Submission Records): This displays the user's submission list, including ID, problem, status, language, time, etc. I set different icons (ThemeIcon) for different submission statuses (AC, WA, TLE, RE...) to make them more intuitive.

The key to implementing TreeView lies in the getChildren (get child nodes) and getTreeItem (define node appearance and behavior) methods. Through EventEmitter and onDidChangeTreeData events, you can notify VS Code to refresh the view.

Webview (Detail Display)

When users click on problems or submission records in TreeView, I use vscode.window.createWebviewPanel to create a Webview for displaying detailed information. Why use webview? Because I needed to render TeX formulas, and JSON requests returned Markdown results.

Content Rendering: Webview is essentially an embedded browser environment with HTML content. I used the markdown-it library to convert Markdown-formatted problem descriptions, input/output formats, etc. obtained from the API into HTML.

Challenge: Mathematical Formula Rendering: OJ problem descriptions often contain LaTeX formulas.

Attempt One (Failed): Initially, I tried including KaTeX JS library and auto-render script in the Webview HTML for client-side rendering. However, this caused the strange issue of formulas being rendered twice (once as original text, once as KaTeX rendered result).

Attempt Two (Success): I realized the problem was in the duplicate rendering flow. The final solution was using markdown-it's KaTeX plugin (@vscode/markdown-it-katex - this package had another developer's version when installing via npm, which was outdated and had security risks, but the good news is that VS Code officially noticed this project and made subsequent fixes, so I used this one). When using md.render() on the extension side (Node.js environment), this plugin directly converts LaTeX in Markdown ( $...$ , $$...$$) to final KaTeX HTML structure. This way, the HTML sent to Webview is already pre-rendered, and the Webview side only needs to include KaTeX CSS (katex.min.css) to display styles correctly, no longer needing KaTeX JS and auto-render scripts.

Commands and Status Bar

I used vscode.commands.registerCommand to register various user operations (set Token, refresh views, submit code, view problems by ID, etc.). I used vscode.window.createStatusBarItem to display current login status and username on the left side of the status bar, which can trigger corresponding commands (like showing user info or setting Token) when clicked.

Packaging and Publishing

Everything worked smoothly during development and debugging (F5), but when I used vsce package to package into a VSIX file and installed it on another computer, I encountered the classic problem: Command 'acmoj.setToken' not found or Cannot find module 'axios'.

Debugging Process

I checked the developer tools by opening VS Code developer tools (Developer: Toggle Developer Tools) Console on the test computer. I found that activating the extension directly reported error Cannot find module 'axios'. I checked VSIX contents using vsce ls command (or renaming .vsix to .zip and extracting) to view package contents. I discovered that the node_modules folder wasn't packaged at all!

Root Cause

I mistakenly placed runtime-required libraries (like axios, markdown-it, katex, @vscode/markdown-it-katex) under devDependencies instead of dependencies in package.json.

Dependencies are libraries required for extension runtime and will be packaged by vsce package. DevDependencies are libraries used during development (compilers, type definitions, linters, packaging tools, etc.) and will not be packaged.

Solution

I carefully checked package.json and moved all runtime dependencies (axios, etc.) to the dependencies section, while keeping development tools (typescript, @types/*, eslint, @vscode/vsce, etc.) in devDependencies.

{
  "dependencies": {
    "@vscode/markdown-it-katex": "...",
    "axios": "...",
    "katex": "...",
    "markdown-it": "..."
  },
  "devDependencies": {
    "@types/vscode": "...",
    "@types/node": "...",
    "@types/markdown-it": "...",
    "@vscode/vsce": "...", // The packaging tool itself is a dev dependency
    "typescript": "...",
    "eslint": "..."
  }
}

Key Step: After modifying package.json, it's essential to perform "clean & reinstall" - I continued getting errors initially because I didn't clear node_modules and package-lock.json.

This time, the generated VSIX file finally contained the correct node_modules, and after installation, commands could be found normally and the extension activated successfully.

TypeScript Interlude

As a TypeScript project, I also encountered some typical type issues:

Module/Type Not Found: Cannot find module 'vscode' or other @types packages, usually resolved by npm install --save-dev @types/vscode @types/node ...

Implicit any: After enabling strict mode, I needed to explicitly add types for callback function parameters (like progress in withProgress, text in validateInput).

API Signature Mismatch: When calling vscode.window.showQuickPick, if providing option objects, you need to pass QuickPickItem[] instead of string[], requiring mapping.

Is This the End?

While acmoj-helper can already run and has helped me considerably in daily use, during the development process, I gradually felt some "growing pains." As features iterated (even with minor adjustments), I found the code becoming somewhat messy:

Unclear Responsibilities: The commands.ts file not only handled command registration but also contained substantial complex business logic implementations like submitCurrentFile. This made the file abnormally bloated, making modifications affect the entire system.

High Coupling: Modifying one module (like cache.ts handling API caching) might unexpectedly affect views (submissionProvider.ts) or command handling. When I mentioned rewriting submissionProvider earlier, that was a typical example - the view layer was too tightly coupled with data fetching and business logic.

Registration Chaos: Command registration was scattered across extension.ts and commands.ts, lacking centralization and clarity.

Extension Difficulties: If I wanted to add new features like "Contest" view or more complex problem filtering logic, it would be extremely painful under the existing structure, requiring careful navigation through various files to ensure existing functionality wasn't broken.

Testing Obstacles: Code mixing UI logic, API calls, and business processing was very difficult to unit test.

These issues made me realize that while the current architecture works, it's not "elegant" and lacks long-term viability. To ensure this project can develop healthily and to improve my own code design skills, I decided to conduct a thorough refactoring.

Refactoring Goals: Decoupling, Layering, Single Responsibility

The new architecture I'm currently working on is roughly divided into these layers:

VS Code Integration Layer (extension.ts, src/commands/index.ts)

Service Layer (src/services/) - Responsible for encapsulating core business logic and interactions with external resources (like APIs, caching). Each service corresponds to a clear domain.

Command Handling Layer (src/commands/) - Command handlers receive calls from VS Code and then use the service layer to complete specific tasks. They serve as bridges between VS Code commands and business logic. Complex logic (like submitCurrentFile) is now clearly encapsulated in corresponding command handlers.

UI Layer (src/views/, src/webviews/) - Responsible for data display and UI interaction. The views/ directory contains TreeDataProviders (like ProblemsetProvider, SubmissionProvider) that get data from the service layer and format it into structures needed by VS Code TreeView. The webviews/ directory contains Webview Panel logic. After refactoring, I created dedicated classes for problem details and submission details (ProblemDetailPanel, SubmissionDetailPanel), encapsulating their respective HTML generation, message handling, and lifecycle management. They also get data through the service layer, and Webview operations (like "copy code") now typically send messages to VS Code via postMessage, responded to by corresponding command handlers.

Core/Data Layer (src/core/, src/types.ts) - Provides the most basic components and definitions. A typical example during refactoring was core/apiClient.ts: a purer HTTP client only responsible for sending requests, handling authentication headers, retry logic, and basic error interpretation. It no longer contains specific business endpoint logic. Previously, getUserProfile, getSubmission, etc. were all in there.

While the refactoring process was quite challenging and temporarily introduced new bugs, it laid a solid foundation for ACMOJ Helper's long-term development. Now I can more confidently implement those more comprehensive features I envisioned at the end of version 1.0.

If you're also interested in VSCode extension development or want to build integrations for tools or platforms you frequently use, don't hesitate - just start doing it! Begin with yo code, encounter problems, solve problems - this process itself is the best learning experience.

Project Repository: TheUnknownThing/vscode-acmoj

Thanks for reading! I hope my experience can be helpful to you.

I'll Never Use memset Again...

Mon, 10 Mar 2025 05:11:40 GMT

0. Foreword

This problem originated from my first programming exam during my freshman year... It was a question involving block decomposition (data chunking), and in my program, I had an operation like this:

memset(mul_tag, 1, sizeof(mul_tag));

Unsurprisingly, the program resulted in a WA (Wrong Answer). I spent a very, very long time debugging. This line looked completely harmless, didn't it? But as it turned out, simply changing this line fixed the program! Why??? The answer becomes clear when we look at the memset function prototype.

1. `memset` Function Introduction

The prototype for the memset function is as follows:

void *memset(void *s, int c, size_t n);

s: A pointer to the block of memory to fill.
c: The value to be set. Note: Although c is of type int, memset actually converts c to an unsigned char before filling.
n: The number of bytes to be set to the value.

The purpose of memset is to set the first n bytes of the memory block pointed to by s to the value specified by c.

2. The Trap

memset performs its filling operation byte by byte. When a is an int array (assuming int occupies 4 bytes), memset(a, 1, sizeof(a)) will set each byte of each int element to 1. This results in each int element having the value 0x01010101, which is 16843009 in decimal, not the 1 we were hoping for.

3. Exceptions

Using memset(a, 1, sizeof(a)) is dangerous in most scenarios. However, there are a few exceptional cases where it works as expected or is safe:

If a is a char array, memset(a, 1, sizeof(a)) is correct because the char type occupies only one byte.
memset(a, 0, sizeof(a)) can be safely used for arrays of any type to initialize the entire array to 0. (This is what we typically do! And it's precisely why I initially thought memset(a, 1, sizeof(a)) would be fine!)
memset(a, -1, sizeof(a)) is safe for int arrays and will correctly initialize the elements to -1. Why? Hint: Computers store negative numbers using two's complement representation. The two's complement of -1 (for a 32-bit int) is 11111111 11111111 11111111 11111111, which means every byte is 0xFF. Therefore, memset(a, -1, sizeof(a)) fills every byte with 0xFF, effectively setting each int element to -1.

4. You Should Use `std::fill`

Instead of memset for non-zero/non-minus-one initializations (especially in C++), you should use std::fill.

std::fill Example (C++):

#include <algorithm>
#include <array> // Or use raw arrays

std::array<int, 10> a;  // Or: int a[10];
std::fill(a.begin(), a.end(), 1); // Or: std::fill(a, a + 10, 1);

std::fill operates on elements of the container or array, assigning the specified value (1 in this case) correctly to each element, regardless of its underlying byte representation.

Installing Windows on a IPv6 VPS

Wed, 15 Jan 2025 14:59:00 GMT

If you happen to have a high-configuration cloud server (like my Afly Black Friday VPS) that doesn't provide Windows images, you might want to try installing Windows yourself using the DD method.

What is DD System Installation?

As the name suggests, DD system installation uses the dd command to transfer a vhd file to a specific partition, then configures boot files to make it bootable. As scripts have evolved, many features have been added (like installation from img or iso images, system rescue). However, this isn't the main focus - this tutorial aims to cover the pitfalls I encountered while using such scripts, and how to solve them.

My Environment Configuration

First, let me introduce my environment (these configurations might seem unusual, but these specific characteristics led to some interesting problems):

CPU: 3 Core AMD Ryzen 9 9950X
RAM: 4.5GB
SSD: 125GB
Network: IPv6 /128 Only (Yes, pure IPv6 environment with no IPv4 access! And only a /128 IPv6 allocation, which becomes important later)

Preparation

Script Used

I chose this script: https://github.com/bin456789/reinstall

I strongly recommend carefully reading the README first, as the repository contains detailed instructions on how to use the script.

System Image Selection

I used an image from TeddySun's collection, which you can find by searching https://teddysun.com/?s=DD to find your preferred image. I selected Windows 10 LTSC because it's relatively clean.

Quick Installation Commands

If you're in a hurry, here are the basic installation commands:

# Download the script
curl -O https://raw.githubusercontent.com/bin456789/reinstall/main/reinstall.sh || wget -O reinstall.sh

# Execute the installation
bash reinstall.sh dd --img https://dl.lamp.sh/vhd/zh-cn_windows10_ltsc.xz

Remember to install curl beforehand (if your system doesn't have it)

First Issue: Incorrect DNS Configuration

This problem was mainly caused by my specific network environment. The DNS configuration in the script's Alpine environment was incorrect, preventing files from being downloaded. Here's my solution:

#!/bin/sh

# Modify /etc/resolv.conf file
echo "nameserver 2001:4860:4860::8888" > /etc/resolv.conf
echo "nameserver 2001:4860:4860::8844" >> /etc/resolv.conf

if [ -f /etc/systemd/resolved.conf ]; then
    echo "[Resolve]" >> /etc/systemd/resolved.conf
    echo "DNS=2001:4860:4860::8888" >> /etc/systemd/resolved.conf
    echo "DNS=2001:4860:4860::8844" >> /etc/systemd/resolved.conf
    systemctl restart systemd-resolved
fi

echo "DNS successfully changed to Google IPv6 DNS"

Of course, if you have a normal dual-stack environment, you probably won't encounter this issue.

Second Issue: Password Setup

I found this particularly interesting: when the script first runs, it prompts you to enter a password, but this password is not the one you'll use to log into Windows! Despite the script's README mentioning this, I missed it.

In fact, the Windows login password is determined by the image. For TeddySun's image that I used:

Username: Administrator
Password: Teddysun.com

Third Issue: Windows IPv6 Privacy Protection

This problem puzzled me for a long time. If you run ipconfig /all on a Windows computer, you might notice something called "temporary address." This is because Windows "protects your online privacy," but in my environment with only a /128 IPv6 allocation, this became a problem: external access to your machine is through that fixed IP address, but your machine accesses external websites using a temporary address. This means you can connect via Remote Desktop but can't access the internet.

The solution is simple - open Command Prompt as administrator:

netsh interface ipv6 set privacy state=disable
# Then restart the network adapter

Fourth Issue: Workarounds for Pure IPv6 Environment

This issue also stems from my special network environment. Not having IPv4 access is quite inconvenient, so I used Cloudflare WARP to provide IPv4 access. However, note that if you directly use the Windows version of WARP, after enabling it, your IPv6 address will also change to WARP's address, preventing you from connecting to Remote Desktop!

I used a solution provided by a user on the Nodeseek forum (original post):

Download and install the official CloudFlare WARP client
In WARP settings:
- Click the gear icon in the bottom right → Preferences
- Advanced → Configure Proxy Mode
- Enable proxy mode and set a memorable port

This effectively gives you a locally available Cloudflare-provided IPv4 exit socks proxy, which you can use however you like - with SwitchyOmega or other tools, configure as you prefer. This way, you can maintain Remote Desktop connections while gaining IPv4 access.

Fifth Issue: LTSC Minor Problem

If you chose the LTSC 2021 version of Windows like I did, you might notice that the wsappx service is always running in the background. This issue has a solution on the PCbeta forum; if you're interested, check out this post: LTSC Optimization Guide

Caddy配置Typecho—Revisited

Tue, 14 Jan 2025 17:12:00 GMT

貌似这个Blog上的第一篇文章就讲了怎么配置Caddy，但是，当时我用的别人的docker镜像，which，集成了nginx, php, 以及typecho，然后我当时直接Caddy反代端口来搭建。

现在入手了阿里云这台服务器，512M内存实在是有点拘谨，那为了抛弃nginx以及docker带来的额外内存占用我就准备手搓Typecho的环境。

首先安装世界上最好的编程语言PHP

1. 添加Sury PPA存储库

首先，添加包含最新PHP包的PPA。为此，需要安装一些依赖包。

sudo apt update
sudo apt install lsb-release apt-transport-https ca-certificates software-properties-common -y

安装工具后，导入Sury库的GPG密钥。Sury包含了几乎一切的PHP版本。Typecho要求PHP版本>7.4，所以我们安装8.2

sudo wget -O /etc/apt/trusted.gpg.d/php.gpg https://packages.sury.org/php/apt.gpg

然后将存储库添加到你的源列表中。

sudo sh -c 'echo "deb https://packages.sury.org/php/ $(lsb_release -sc) main" > /etc/apt/sources.list.d/php.list'

更新包列表以验证其功能。

sudo apt update

2. 安装PHP 8.2包

安装PHP 8.2及其常用扩展。

sudo apt install php8.2 php8.2-cli php8.2-fpm php8.2-mysql php8.2-curl php8.2-gd php8.2-mbstring php8.2-xml php8.2-zip php8.2-opcache php8.2-sqlite3 -y

安装Caddy v2

网上找了一圈都是用Caddy v1 + 特殊的伪静态配置来实现了。但是Caddy v2一个非常重要的升级就是不需要额外的伪静态配置，显然v2是更方便的。我们不希望削足适履。

Caddy官方提供了脚本，我当然最推荐这个：

sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy

当然，如果你需要额外的插件（比如dns-providers之类的），你自然可以使用xcaddy来自己编译一个，但这不是这篇博文的重点，于是略去。

配置Caddyfile

还是那句话，我不希望削足适履，所以我更希望你学习使用Caddyfile，而非传统的JSON配置文件。

以下配置文件经试验，粘贴即可使用：

YOUR WEBSITE {
        encode gzip
        log
        tls YOUR EMAIL
        header Strict-Transport-Security max-age=31536000
        root * /var/www/YOUR WEBSITE
        php_fastcgi unix//run/php/php8.2-fpm.sock
        file_server
}

此时你是不是真切地对Caddy v2的便捷性有了认识？你不需要配置php-fpm相关的内容，你不需要配置伪静态……可以说是开箱即用的体验！

最后一步！添加你的Typecho站点文件

使用 wget 下载 Typecho 的最新版本：

wget https://github.com/typecho/typecho/releases/latest/download/typecho.zip

使用 unzip 命令将 Typecho 解压到 /var/www 目录：

确保 /var/www 目录存在：
```
sudo mkdir -p /var/www
```
然后创建你的网页目录，比如我的网页是20051110.xyz，那么你就创建
```
sudo mkdir /var/www/20051110.xyz
```

解压 typecho.zip 到 /var/www/你的网站目录：

sudo cd /var/www/你的网站目录
sudo unzip /root/typecho.zip #记得替换成你Typecho实际下载的位置

最后一步！修改文件权限：将 /var/www 目录及其子目录的所有权更改为 www-data 用户和用户组（这是 Web 服务器通常使用的用户）：

sudo chown -R www-data:www-data /var/www

你可以检查一下你的目录结构是否如下图所示：

/var/www/你的网站地址
├── admin/
├── install/
├── usr/
├── var/
├── index.php
├── install.php
└── ...

如果正确，那么就Caddy，启动！

caddy run --config=Caddyfile #这里因为我跟Caddyfile处于同一个目录中，所以就这样写

caddy会自动帮你申请证书，申请完成后，即可访问网站，进行Typecho的安装（不用怕，都是图形界面，点点鼠标即可）

最后如果你喜欢我的Typecho主题，欢迎来这个主题的Github仓库看看！这个主题我基于原有的Fantasy主题开发，几乎完全重写了css，支持了深色模式，还添加了一些功能（比如碎碎念页面，比如底部的一言，……）欢迎给我个Star！我的仓库链接

感谢你读到这里！希望这篇文章对你有用！

A Few Things About OpenWRT Compilation

Sun, 20 Oct 2024 01:05:00 GMT

Let me answer a few questions first

Why compile it myself? I'm a mature computer science student (lol).
Why use Github Actions? Because Github is really convenient ~~I originally wanted to use my dedicated AMD 9950X as the build machine, but it failed after 15 minutes and I was too lazy to troubleshoot~~
Why can't I understand this? ~~If you don't understand, don't read it.~~ Just download pre-compiled packages from others.

The following is based on the latest OpenWRT (23.05) + Github Actions online compilation

1. Preparation

Clone the repository locally

First, you need to Fork the LEDE source code. LEDE repository [Github][1]

Clone the repository you just forked to your local machine:

git clone https://github.com/your-username/lede

Don't download ZIP! The ZIP file is not a Git repository and doesn't contain the .git folder, so you can't use git commands on it.

Update Feeds

cd lede
./scripts/feeds update -a
./scripts/feeds install -a

If you don't update the feeds, you won't see Luci apps later! This step is mandatory!

Enter the configuration menu

Use the following command to enter the configuration menu:

make menuconfig

Configuration menu explanation

Generally, you only need to modify these:

Target System: Processor architecture
Subtarget: Select processor
Target Profile: Preconfigured profile
LuCI: LuCI plugins
- Applications: Applications
- Themes: Themes

For example, I selected:

Target System: Mediatek-ARM
Subtarget: Filogic
Target Profile: ASR3000
LuCI: LuCI plugins
- Applications: Many fun plugins for you to explore!
- Themes: luci-theme-material

After making changes, select Save to save as a .config file.

For Luci plugins, please refer to [this article on the Enshan forum][2]

Commit to your forked repository

Delete the /.config line in the .gitignore file to stop ignoring the config file. Very important!!! Otherwise, the .config file won't be included when you commit!!!
Commit changes to GitHub:

git add .
git commit -m "upd: personal config"
git push origin master

~~The branch is called master, has a master-servant flavor to it~~

Pitfalls:

Enable WIFI before compilation

If you need to enable WIFI by default for easy management, I searched many tutorials online but they were useless, mostly from around 2015 with no reference value. I figured it out myself: Go to the package/lean/default-settings/files/ directory, edit the file zzz-default-settings Comment out these two lines by adding # at the beginning:

sed -i '/option disabled/d' /etc/config/wireless
sed -i '/set wireless.radio${devidx}.disabled/d' /lib/wifi/mac80211.sh

Github Actions Compilation

Online guides still say "submit a Release and it will automatically trigger Github Actions" but that didn't work for me, so I needed to make some changes:

When using Github Actions for compilation, remember to go to the Workflow page and enable the Workflow, and also enable OpenWrt-CI (because Workflows in forked repositories are disabled by default)
Also modify the repository's .github/workflows/openwrt-ci.yml, changing the cron task at the beginning (line 10) to the following to allow manual workflow triggering:
```
on:
  repository_dispatch:
  workflow_dispatch:
```
It will take about two hours, ~~but what does that have to do with me since I'm using Github's resources~~

Modifying various miscellaneous settings

Change the default theme

sed -i "s/luci-theme-bootstrap/luci-theme-material/g" feeds/luci/collections/luci/Makefile

 (Nowadays people's aesthetics seem to prefer the argon theme, anyway this should match what you installed in the `luci-themes` section of your `.config`)
- Add compiler information

sed -i "s/OpenWrt /TheUnknownThing build $(TZ=UTC-8 date "+%Y.%m.%d") @ OpenWrt /g" package/lean/default-settings/files/zzz-default-settings

You probably don't want to keep "TheUnknownThing" as your builder name, change it to something else.
- Modify the default management address
The default management address is `192.168.1.1`, if it conflicts with your upstream network segment, you can modify it

sed -i 's/192.168.1.1/192.168.2.1/g' package/base-files/files/bin/config_generate

This changes it to `192.168.2.1`



[1]: https://github.com/coolsnowwolf/lede
[2]: https://www.right.com.cn/forum/thread-3682029-1-1.html

How to Elegantly Annotate PDFs with LaTeX

Wed, 09 Oct 2024 00:00:00 GMT

Let me cut to the chase—it's getting late and I need some sleep!

The inspiration for this solution comes from Stackexchange. I've tried several annotation software options before, but I really want to bring only my iPad to class. Using VNC or xrdp to remotely access Linux just feels clunky and has terrible latency. So I've settled on using Overleaf + LaTeX with PDF page inclusion for annotations.

The Stackexchange author provided this solution:

\documentclass{article}
%\url{http://tex.stackexchange.com/q/85651/86}
\usepackage[svgnames]{xcolor}
\usepackage{pdfpages}
\usepackage{tikz}

\tikzset{
  every node/.style={
    anchor=mid west,
  }
}

\makeatletter
\pgfkeys{/form field/.code 2 args={\expandafter\global\expandafter\def\csname field@#1\expandafter\endcsname\expandafter{#2}}}

\newcommand{\place}[3][]{\node[#1] at (#2) {\csname field@#3\endcsname};}
\makeatother
\newcommand{\xmark}[1]{\node at (#1) {X};}

\begin{document}

\foreach \mykey/\myvalue in {
  ctsfn/{Defined in Week 1},
  metsp/{Defined in Week 3},
} {
  \pgfkeys{/form field={\mykey}{\myvalue}}
}

\includepdf[
  pages=1,
  picturecommand={%
    \begin{tikzpicture}[remember picture,overlay]
%%% The next lines draw a useful grid - get rid of them (comment them out) on the final version
    \draw[gray] (current page.south west) grid (current page.north east);
\foreach \k in {1,...,28} {
      \path (current page.south east) ++(-2,\k) node {\k};
}
\foreach \k in {1,...,20} {
      \path (current page.south west) ++(\k,2) node {\k};
}
%%% grid code ends here
\tikzset{every node/.append style={fill=Honeydew,font=\large}}
\place[name=ctsfn]{14cm,17cm}{ctsfn}
\place[name=metsp]{11cm,9cm}{metsp}
\draw[ultra thick,blue,->] (ctsfn) to[out=135,in=90] (9cm,17.3cm);
\draw[ultra thick,blue,->] (metsp) to[out=155,in=70] (6cm,9cm);
    \end{tikzpicture}
  }
]{tikzmark_example.pdf}

\end{document}

The original author's result:

This immediately caught my eye because:

It has a coordinate grid, making annotation placement super convenient
It's highly extensible—you can mix text and graphics, insert TikZ diagrams, mathematical formulas, you name it!

However, there were several issues to address:

The professor's Beamer slides are in landscape format, but this code produces portrait output
The macro definitions are somewhat messy, and I don't need fancy connecting lines. Plus, the includepdf call is too verbose and inelegant for repeated use
The coordinate grid looks pretty ugly

So here's how I solved these problems:

Fix the orientation: Use \usepackage[paperwidth=12cm, paperheight=16cm, landscape]{geometry} to make it landscape format.
Create a clean macro to simplify includepdf usage and support multiple annotations:

\newcommand{\includePDFWithAnnotations}[2]{
\includepdf[
  pages=#1,
  picturecommand={%
    \begin{tikzpicture}[remember picture,overlay]
    %%% The next lines draw a useful grid - get rid of them (comment them out) on the final version
    \draw[very thin, lightgray] (current page.south west) grid (current page.north east);
    \foreach \k in {0,...,11} {
      \path (current page.south east) ++(-0.55,\k + 0.2) node[font=\tiny] {\k};
    }
    \foreach \k in {0,...,14} {
      \path (current page.south west) ++(\k,0.2) node[font=\tiny] {\k};
    }
    %%% grid code ends here
    \tikzset{every node/.append style={fill=Honeydew,font=\huge}}
    % Iterate through annotation list and place annotations
    #2
    \end{tikzpicture}
  }
]{YOUR PDF NAME.pdf}
}

Use the macro elegantly to insert multiple annotations:

\includePDFWithAnnotations{1}{
\place{5, 4}{$123avd$}
\place{7, 8}{$456xyz$}
}

\includePDFWithAnnotations{7}{
\place{5, 4}{$123avd$}
\place{7, 8}{$456xyz$}
}

Improve the aesthetics: Move the coordinate grid to the page edges, use tiny font size, make the lines thinner and lighter colored. Much more visually appealing!

Isn't that satisfying?

Here's the complete TeX example:

\documentclass[UTF8]{ctexart}
\usepackage[svgnames]{xcolor}
\usepackage[paperwidth=12cm, paperheight=16cm, landscape]{geometry}
\usepackage{pdfpages}
\usepackage{tikz}
\usepackage{amsmath,amsfonts,amssymb,amsthm}

\tikzset{
  every node/.style={
    anchor=mid west,
  }
}

\makeatletter
\pgfkeys{/form field/.code 2 args={\expandafter\global\expandafter\def\csname field@#1\expandafter\endcsname\expandafter{#2}}}

\newcommand{\place}[2]{\node at (#1) {\large #2};}
\makeatother

\newcommand{\xmark}[1]{\node at (#1) {X};}

\newcommand{\NotePage}[2]{
  \includepdf[
    pages=#1,
    picturecommand={%
      \begin{tikzpicture}[remember picture,overlay]
      %%% The next lines draw a useful grid - get rid of them (comment them out) on the final version
      \draw[very thin, lightgray] (current page.south west) grid (current page.north east);
      \foreach \k in {0,...,11} {
        \path (current page.south east) ++(-0.45,\k + 0.2) node[font=\tiny] {\k};
      }
      \foreach \k in {0,...,14} {
        \path (current page.south west) ++(\k,0.2) node[font=\tiny] {\k};
      }
      \place{0,11.25}{Page #1}
      %%% grid code ends here
      \tikzset{every node/.append style={fill=Honeydew,font=\huge}}
      #2
      \end{tikzpicture}
    }
  ]{LA14.pdf}
}

\begin{document}

\NotePage{1}{
  \place{1,4.5}{That is because $\det{A} = \det{A^\top}$}
}
\NotePage{2}{}

\end{document}

Use lsyncd for Real-Time File Synchronization

Sun, 06 Oct 2024 10:56:00 GMT

Ever since I became an mjj (server hoarder), I’ve accumulated a lot of VPSs—I just can’t resist buying more. But to keep my blog data synchronized across multiple servers, I’ve put in quite a bit of effort. I got tired of using cron to automatically package my blog directory and then manually back it up. Ultimately, it’s laziness—I want a fully automated solution. Since my Typecho blog is deployed via Docker images, I figured I’d go the extra mile and tackle multi-end data synchronization, so that any change on one site is reflected on all sites.

When it comes to synchronizing files between multiple servers, rsync is a commonly used tool. It achieves efficient directory synchronization through incremental transfers, compression, and deletion operations. However, the classic working mode of rsync is “manual or scheduled trigger,” which falls short for scenarios requiring real-time synchronization.

How rsync Works

rsync compares the differences between the source and target directories and only transfers changed files or data blocks, reducing bandwidth usage. This method is ideal for backing up and synchronizing large amounts of data, especially in bandwidth-constrained environments. However, rsync usually needs to be triggered manually or via scheduled tasks (like cron). For applications that require real-time updates, this approach leads to data lag and resource waste.

The Shortcomings of rsync + inotify

To address real-time synchronization, you can use inotify to monitor file system changes and trigger rsync when changes occur. However, this approach has several obvious drawbacks:

inotify requires additional scripts to work with rsync, increasing system complexity.
This solution is usually one-way and cannot achieve multi-source real-time synchronization, which goes against my goals.

The Advantages of lsyncd

To solve the above problems, lsyncd combines inotify’s real-time monitoring with rsync’s efficient transfer capabilities, providing a simple yet powerful solution for real-time synchronization. The advantages of lsyncd include:

lsyncd can handle complex real-time synchronization tasks with a simple configuration file, eliminating the need for extra scripts.
It supports one-way synchronization between multiple servers, ensuring that data on every server is up to date. Note: lsyncd does not natively support true bidirectional or multi-master sync with conflict resolution. But it doesn't matter in my case because I only need one-way sync.

Step-by-Step Guide to Configuring lsyncd

Here’s how to use lsyncd for real-time synchronization:

Install lsyncd and rsync:

On all servers involved in synchronization, run the following command to install the necessary tools:

sudo apt-get install lsyncd rsync

Configure lsyncd:

On each server, create the configuration file /etc/lsyncd.conf with the following content:

settings {
    logfile = "/var/log/lsyncd/lsyncd.log",
    statusFile = "/var/log/lsyncd/lsyncd.status",
    inotifyMode  = "CloseWrite or Modify",
    maxProcesses = 1,
    -- nodaemon = true,
}

sync {
    default.rsyncssh,
    source = "/var/www",
    targetdir = "/var/www",
    host = "45.*.*.*",
    delete = true,
    rsync = {
        binary = "/usr/bin/rsync",
        archive = true,
        compress = true,
        verbose = true,
    },
    delay = 1,
}

Explanation:

source: The local directory to monitor, /var/www/ (replace with your own).
host: The remote target server (excluding itself).
targetdir: The remote target directory, /var/www/ (replace with your own).
delay: Sets the synchronization delay (in seconds) to prevent excessive syncing during frequent changes.
delete: Deletes files on the target server that have been deleted on the source server.

Note: When using rsyncssh, maxProcesses must be 1. If using rsync, you can set a higher value (e.g., 5).

Tip: For troubleshooting, it’s recommended to start with lsyncd /etc/lsyncd.conf to check for errors. Also, make sure to create the log directory first: mkdir -p /var/log/lsyncd.

One more thing: To allow servers to log in to each other without a password, you need to set up SSH key-based authentication.

To automate real-time synchronization, ensure the source server can log in to the target server via SSH without a password.

On the source server, generate an SSH key:

ssh-keygen -t ed25519

Follow the prompts; usually, you don’t set a passphrase.

Copy the public key to the target server:

ssh-copy-id user@target-server

This copies the generated public key to the target server, enabling passwordless login. Note: You need to configure this on both servers if you want mutual access.

Once configured, start verification.

Start lsyncd:

lsyncd /etc/lsyncd.conf

Verify the Configuration:

Perform file operations in the /var/www/ directory on any server and check synchronization on the others.

Handling Conflicts

When multiple servers modify the same file at the same time, conflicts may occur. However, my use case probably won’t encounter conflicts, so I’m leaving it as is :D

Disabling Adobe Acrobat's OCR Feature

Wed, 11 Sep 2024 19:11:00 GMT

Acrobat’s OCR really annoys me. Every time I edit a PDF, it freezes for a moment—I have to wait for the current page’s OCR to finish before I can turn off automatic text recognition. So now, I’m just going to disable it once and for all. Go to this directory:

C:\Program Files (x86)\Adobe\Acrobat DC\Acrobat\plug_ins

Do you see "PaperCapture" there? Just rename it to "PaperCapture_disabled" and you’re done.

方正书版10.0从安装到入门

Sat, 16 Mar 2024 21:05:00 GMT

今儿花了一下午总算是把方正书版10.0搞定了。

1. 安装PDF Creator

方正PDFCreator 3.0

重要提示：请务必在系统装完后第一时间就安装字库和PDF Creator（虽然我不信这个邪，但是确实这样会少很多乱七八糟字体的干扰或者你从另外什么地方安装了字库的干扰）

安装顺序如下：

安装PDFCreator3108；
把破解文件覆盖在安装文件的目录下C:\ProgramFiles\Founder\PDFCreator\Bin
导入注册表〖根据你安装在哪个盘上要修改盘符和路径〗；(没安装RIP软件的，才导入这个注册表文件。目的是“欺骗”系统，让系统认为你安装了RIP软件)
先安装CID5.01(748_GB)字库，“方正CID V5.00〖全套〗”安装密码：安装系列号：000000000 安装密码：42C2D35B4735036B; 字体密码：5918347506891A57（包括GBK、GB/748，都一样！）再安装CID5.0(GBK)字库，安装序列号000000000安装密码：ce9d84241294e529;字体密码：2e4965af7e74ad68 ；安装字库时选择“方正世纪RIP”(我这里没有弹出选择这个选项，不过不要紧，还是顺利安装了)；
字库路径为:C:\ProgramFiles\Founder\PDFCreator\Font，此时会在这个目录下生成一个fonts的目录C:\Program Files\Founder\PDFCreator\Fonts；（也可能不会生成！这个时候需要另外一个文件帮忙！安装完两个字库一半会生成一个FONTS文件夹，但很多人电脑偏偏没有生成，有些简单的后端字体识别不出，所以要借用PSPNT的FONTS文件夹字体来补充。）
打开PDFCreator 3108；
字体重置：字体路径为: C:\ProgramFiles\Founder\PDFCreator\Resource\CIDfont；TrueType字体路径为：C:\WINDOWS\Fonts
PDFCreator重置字库时候千万不要去点它或者动它，不然会死机，只能重启。
我这边提示安装成功了1100多个字体，最后可以正常输出PDF。

2. 安装书版10.0

这个不用多说了，安装完成以后把修改的文件复制到安装目录下。

注意：在这一步就安装书版10.0还有女娲补字就好了，别的一切都不要安装，包括这里面什么字体啊什么的，免得到时出问题。

3. PS输出设置

在FBD输出PS/EPS时候点开左下角的“选项”，我这里比较粗暴，把后端748字库和后端GBK字库都勾选了“全部已安装”。按照我的测试来看，只要你的PDFCreator配置良好，这样就可以正常输出PDF了。在输出之前不妨去网络上随便找个正常的PS文件试试看，避免是因为自己PS有问题错怪了PDFCreator。

4. Word文件转FBD小样的一些问题

已知书版10.0的doc文件转换是broken的，不要用。
在使用这个网络上大神开发的软件时，小样的输出会有一些问题，这里总结如下：

如果你在用Word转FBD 6.0版本，那么最终doc文件里的所有MathType公式都保留其原样即可，不要跟着网上的教程转换啊什么的；如果你用的是5.6版本，那么需要跟着网上教程走。

MathType转换过后sin, cos, π, ln……这种数学符号会变斜，需要自己纠正。我写了个VSCode里用的正则表达式可以参考（Ctrl+F查找替换，选择正则表达式）（这里看的会是乱码，但是粘贴进去就是那个圈z）：

上面填这些：(cos|sin|tan|π|lim|ln|i)
下面填这些：$1

同样对于根号和字符贴在一起的情况，需要在〖KF(〗前加圈1/2，同样可以使用替换来实现。

对于选择题选项的排版，这里写了一个Python小程序，你只需要配置好第一个选择题的WB即可，它实现了以下功能：

在开头和结尾加上〖ZK(〗和〖ZK)〗（ZK+换行符）
替换（换段）符号（除了第一处）为〖DW1〗到〖DW3〗

import pyperclip

def transform_text(input_text):
    # 在开头和结尾加上特定标记
    transformed_text = '〖ZK(〗' + input_text + '〖ZK)〗'

    # 替换符号为〖DW1〗到〖DW3〗，跳过第一个符号
    parts = transformed_text.split('')
    transformed_text = ''.join(parts[:2])  # 保留第一个符号
    for i, part in enumerate(parts[2:], 1):  # 从第二个符号开始替换
        transformed_text += '〖DW' + str(i) + '〗' + part

    return transformed_text

# 原始文本
original_text = ''
original_text = pyperclip.paste()

transformed_text = transform_text(original_text)

# 将转换后的文本打印出来
print(transformed_text)
pyperclip.copy(transformed_text)

至此，请愉快地开启你的排版生涯吧！

Starting to Write Again

Sat, 09 Mar 2024 22:44:00 GMT

It probably began with a sudden inspiration during the winter break, preparing to take care of my blog again.

The last time I seriously ran a blog was perhaps in middle school. Back then, I thought having a Blogger was cool, and having your own space for thoughts on the internet was great and trendy—I guess it's not like that anymore?

CNBlogs urges everyone to turn off ad-blocking plugins, the atmosphere on CSDN in China keeps getting worse, and Bloggers have turned into Vloggers—why am I starting to take care of my blog again at this time?

I didn't expect to encounter so many difficulties deploying Typecho... My previous setup was Nginx+MySQL+BaoTa, so you can understand that BaoTa had taken care of everything; all I needed to do was click the deployment button and it was ready to go.

So maybe the first thing I tried to do to become a True Blogger is to organize all the things myself.

I dropped Nginx for Caddy2 (isn't this just asking for trouble? I couldn't find any working URL rewrite configurations online after searching for days!) Of course, I thought about giving up (php-fpm configuration was fine, mysql configuration was fine, Caddy rewrite was fine and the homepage was accessible, but articles were unreadable? Login page worked but couldn't log in?) In the end, the almighty Docker solved the problem, and I think it's worth pasting the setup here for backup:

docker run -d \
--name=typecho-blog \
--restart always \
--mount type=tmpfs,destination=/tmp \
-v /root/Typecho-Files:/data \
-e PHP_TZ=Asia/Shanghai \
-e PHP_MAX_EXECUTION_TIME=600 \
-p 127.0.0.1:9080:80 \
80x86/typecho:latest

Here I didn't expose the host port because I planned to use Caddy as a reverse proxy. When using Caddy as a reverse proxy, be aware of these two pitfalls I encountered (initially I could only access the homepage, but clicking on any content would redirect me to localhost:9080 which was inaccessible—how to solve it? Turns out I didn't properly set X-Forwarded-Proto and X-Forwarded-Port):

Check Typecho's config.inc.php file to ensure that TYPECHO_SITE_URL is set to your public domain.
In the Caddy configuration, make sure to set the correct X-Forwarded-For and X-Forwarded-Proto headers so Typecho knows the actual request protocol and client IP.

Your Caddy configuration should look like this:

YOUR_DOMAIN_GOES_HERE {
  reverse_proxy http://localhost:9080 {
       header_up Host {host}
       header_up X-Forwarded-Host {host}
       header_up X-Forwarded-For {remote_host}
       header_up X-Forwarded-Proto {scheme}
    }
  tls YOUR_EMAIL_GOES_HERE
}

Great, I finally have my own blog again. Hope I can write more in the future.

TheUnknownBlog

A Simple Hack to Use Touch ID for sudo on macOS

How It Works (The Quick Version)

The 2-Minute Setup Guide

Step 1: Open the Terminal

Step 2: Open the PAM Configuration File

Step 3: Save and Exit

Time to Test It!

Good to Know

Overview of the RISC-V Design with Tomasulo's Algorithm

Disclaimer

Part I: The Language of Hardware - Verilog Fundamentals

Section 1.1: Thinking in Parallel

Section 1.2: Describing Behavior - initial and always Blocks

The initial Block

The always Block

Section 1.3: The Heart of Synthesis - Blocking vs. Non-blocking Assignments

Blocking Assignments (=)

Non-blocking Assignments (<=)

Example: The Shift Register

Pitfalls and Best Practices

Part II: The Blueprint of a CPU - The RISC-V ISA

Section 2.1: An Introduction to Instruction Set Architectures (ISA)

Section 2.2: The RISC-V Revolution - Openness and Modularity

The RISC Philosophy

Open and Free

Modular Design

Section 2.3: Anatomy of a RISC-V Instruction

Part III: The Assembly Line - Pipelined Execution and Its Perils

Section 3.1: The Classic 5-Stage RISC Pipeline

Section 3.2: When the Assembly Line Breaks - Pipeline Hazards

Structural Hazards

Data Hazards

Control Hazards

Section 3.3: Basic Hazard Resolution - Stalling and Forwarding

Stalling (Pipeline Bubbles)

Forwarding (Bypassing)

Part IV: The Brains of the Operation - Dynamic Scheduling with Tomasulo's Algorithm

Section 4.1: Beyond In-Order Execution

Section 4.2: Core Components of the Tomasulo Machine

Reservation Stations (RS)

The Common Data Bus (CDB)

Hardware Register Renaming

Section 4.3: A Cycle-by-Cycle Walkthrough of Tomasulo's Algorithm

Simulation Setup:

Example Instruction Sequence:

Initial State:

Cycle 1:

Cycle 2:

Cycle 3:

Cycle 4:

Cycle 5:

Section 4.4: Taming the Chaos with the Reorder Buffer (ROB)

ROB Mechanism

Section 4.5: Achieving Precise Exceptions and Speculation

Precise Exceptions

Branch Speculation

Appendix A

WAR Hazard

WAW Hazard

Enabling KVM GPU Passthrough

Credits

Enabling IOMMU

Setup

BIOS Settings

Checking for IOMMU Support on your CPU

Linux GRUB Settings

GPU Passthrough

Find IOMMU Groups

Loading the Correct Kernel Modules

Passing the GPU to the Guest VM

Creating a VM

Prerequisites: Check Hardware Virtualization Support

Install Libvirt

Add Your User to the libvirt Group

Verify the Installation

Create a Virtual Machine

Launching virt-manager

Accessing VM through virsh console

Virsh console

Section 1.2: Describing Behavior - `initial` and `always` Blocks

The `initial` Block

The `always` Block

Blocking Assignments (`=`)

Non-blocking Assignments (`<=`)

Add Your User to the `libvirt` Group

Launching `virt-manager`

Accessing VM through `virsh console`

1. `memset` Function Introduction

4. You Should Use `std::fill`