The MDTM Project

MDTM Middleware

A User-space Resource Scheduler

To achieve higher scalability and efficiency, most existing OS schedulers use a distributed run-queue model, in which the scheduler maintains one run queue per core. The scheduler applies a thread-independent scheduling policy, which schedules threads independently, regardless of application types and dependencies. Periodically, the scheduler balances the load across cores to facilitate load balance. In the case of NUMA systems, the balancing is across all NUMA nodes. When data transfer applications run on multicore systems, dynamic load balancing may result in frequent thread migration, or leading to high-latency inter-node communications, which would significantly degrade the overall data transfer performance. Furthermore, I/O devices (e.g., NIC and storage) on NUMA systems are connected to processor sockets in a NUMA manner. This results in NUMA effects for transfer between I/O devices and memory banks, as well as CPU I/O access to I/O devices. Investigations show that I/O throughputs can be significantly improved if applications can be placed on cores near the I/O device they use (i.e., I/O locality). However, existing OSes have very limited supports for such IO locality. Processes/threads may end up being scheduled on cores that are distant from the I/O devices they use, leading to high-latency inter-node I/O operations and incurring extra communication overheads. Bulk data transfers involve significant network and disk I/O operations. Using default OS scheduling can lead to significant inter-node I/O operations and severely degrade the overall data transfer performance.

MDTM middleware has been developed to address these problems. It is a user-space resource scheduler that harnesses multicore parallelism to scale data movement toolkits at multicore systems.

MDTM middleware is implemented as a system daemon. Periodically, the daemon collects, monitors, and caches information about the multicore system physical layout (e.g., NUMA topology), configurations, and system loads. Using this information, MDTM middleware will provide query and scheduling services to the data transfer tool, mdtmFTP.

Support Features

NUMA system topology profiling

Online system status monitoring
- cpu load per core
- memory load latency per NUMA node
  - This feature allows applications to use system memory intellignently to avoid memory hotspots.

NUMA topology-based resource scheduling

mdtmFTP will generate three types of threads: Network I/O threads, Disk I/O threads, and Management threads. The MDTM middleware will schedule Disk/Network I/O threads to cores close to I/O devices (I/O locality).

Supporting core affinity on network and disk I/Os

System zoning
The MDTM middleware will partition system cores into two zones -- a MDTM-zone and a Non-MDTM-zone. A data transfer application will run in the MDTM-zone while other applications will be confined within the Non-MDTM-zone. This strategy will reduce and minimize other applications' interference to the data transfer applicaiton, thus resulting in optimium data transfer performance.

MDTM middleware was designed to support mdtmFTP. However, it can be readily extended to support other types of applications. Or, it can be used to study advanced scheduling algorithms and policies on NUMA systems.

Last modified
02/14/2019