Chip Multithreading System Need a New Operating System Scheduler

Oct 24, 2025

Authors: Alexandra Fedorova, Christopher Small, Daniel Nussbaum, and Margo Seltzer (Harvard University, Sun Microsystems)

Published in: USENIX 2005

Question–Answer Form

1. What is your take-away message from this paper?

At the time the paper was written, schedulers did not take advantage of multithreading in hardware, which led to lost potential performance especially for OLTP workloads where threads have contention for CPU resources. The new metric called instruction delay latency gives useful insight for how to benchmark CPU utilization, but it does not take into account other CPU resources such as caches.

2. What is the motivation for this work?

What is the people problem and the technical problem?
- Modern server applications (web services, online transaction systems) had poor utilization for CPU pipeline.
How is it distilled into a research question?
- CMP (chip multiprocessing) and hardware multithreading (MT) were designed to improve processor utilization for OLTP workloads, but the scheduler policy did not take advantage of new CPU architecture.
Why doesn't the people problem have a trivial solution?
- OLTP workload requires hundreds of threads which leads to 10^27 combinations to evaluate, hence the need for different designs
- Modeling resource contention is a hard problem; good prediction is difficult to achieve
What are the previous solutions, and why are they inadequate?
- Previous solutions ran on MT systems which yielded 17% improvement but were not designed for CMP.

3. What is the proposed solution (hypothesis, idea, design)?

Why is it believed this solution will work?
- CPI (cycles-per-instruction) is used as a heuristic to measure workload
- From the experiments it is observed that instruction mix between long-latency instruction and short-latency instructions yield the best CPU ultilization because it can interleaves execution from threads
How does it represent an improvement?
- specialized scheduler for CMT systems has the potential for a much greater gain—it can improve application performance by as much as a factor of two over a naïve scheduler.
- A naïve scheduler can severely hurt performance, potentially making a multithreaded processor perform worse than a single-threaded one. By preventing this poor performance, the new design delivers significant throughput improvements
How is the solution achieved?
- the paper was able to proved that the current design of scheduler is not good enough and provide several keys factors for future work to consider for new scheduler design
- considerations for scheduler designs:
  - resource contention
  - metrics such as CPI that can help benchmark, evaluate the effective of the design
  - co-scheduling heuristic seems to be a good policy to take into account

4. What is the author's evaluation of the solution?

Logic:
- focus on processor pipeline because this is the source of contention
- processor pipeline depends on the latencies of the instructions. Instructions with long delay latencies(memory loads) will leave functional unit unused
- mix instructions of long and short delay instructions can keep the pipeline busy at all time
Experiments:
- Scenario: compare performance between conventional single-threaded core and on a traditional multiprocessor
- CMT system simulator toolkit based on multithreading proposed by Laudon et Al
- RISC pipeline with one set of functional unit. CPU has 4 hardware contexts per CPU core. Each core has a single shared TLC and L1 Data and instruction caches. L2 cache is shared by all CPU cores on the chip
- Alternatives: SMT(simultaneous multithreaded) systems where each core has multiple set of function units on each CPU core
- Workloads:
  1. CPU bound workload with 4 threads that execute only ALU instructions
  2. Memory bound workload with delay latency of 4 cycles
- Configurations:
  1. A: singled-threaded processor
  2. B: multithreaded processor with four hardware contexts(but only 1 functional unit)
  3. C: 4-way multiprocessor
Results
- CPU-bound workload:
  - A ~~ B
  - C = 4x A or B
- Memory-bound workload:
  - B ~~ C
  - B and C = 4x A

5. What is your analysis of the identified problem, idea, and evaluation?

Today schedulers have been improved to take advantage of CMP and MT architectures:
- Linux has improved its CFS scheduler to be SMT aware but still lacking
- Window 10, 11 and Mac OS has topology aware

6. What are the paper's contributions?

Author's view:
- Addressing performance loss
- Demonstration current scalability failures
- Metrics suggestions for future experiments
- Able to prove significant performance gains
- Heuristics strategy for scheduler design

7. What are future directions for this research?

Author's suggestions:
- Techniques for inferring single-threaded CPI, given CMT CPI
- Determining the effects of cache contention on the throughput of co-scheduled threads.
- Investigating other workload characteristics, e.g., static instruction mix, to improve scheduling decisions.
- Studying the nature and dynamics of CPIs exhibited by real workloads to understand whether this is a viable metric to be used for scheduling real applications
- Investigating ways to integrate these ideas with other scheduling policies
- Testing our scheduling ideas on real workloads.

8. What questions are you left with?

Q1: How do current schedulers take advantage of new hardware?
Q2: How does Window use processor group with core topology info
Q3: How does Apple Silicon with ARM architectures not supporting SMT but still very performant?

NOTES

p.1

frequent branches and control transfers, can result in processor pipeline utilization as low as 19%

common workload for web services result in poor CPU utilization

p.1

MT-savvy operating system scheduler could improve application performance by a factor of two

scheduler that utilize multithreading can improve performance

p.1

application servers, web services, and on-line transaction processing systems, are notorious for poor utilization of CPU pipeline

example for common workloads

p.1

short stretches of integer operations, with frequent dynamic branches. This negatively affects cache locality and branch prediction and causes frequent processor stalls

explanation for poor CPU performance

p.1

do little for transaction processing-like workloads.

CPU not designed for common workload (works best for scientific applications)

p.1

improve processor utilization for transaction-processing-like workloads by offering better support for thread-level parallelism (TLP

How MT improves performance

p.1

A CMP processor includes multiple processor cores on a single chip, which allows more than one thread to be active at a time and improves utilization of chip resource

CMP processor

p.1

An MT processor has multiple sets of registers and other thread state and interleaves execution of instructions from different threads

MT processor

p.1

ardware vendors are proposing architectures that combine CMP and MT. We will refer to such systems as chip multithreading (CMT) systems

combine both MT and CMP called CMT

p.1

Our experiments have shown that the potential for performance gain from a specialized scheduler on CMT systems is even greater, and can be as large as a factor of two.

a new scheduler is needed to utilize new software

p.2

Scheduling on single-processor MT systems has been studied before [10-12]. The scheduling algorithms for single-processor MT systems discussed in the literature worked as follows: they ran all combinations of threads that could be co-scheduled, determined which combination(s) yielded the best performance,