Publication View

Using multirail networks in high-performance clusters (2001)

Abstract
Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load, and allocation scheme. The compared methods include a static rail allocation, a basic round-robin rail allocation, a local-dynamic allocation based on local knowledge, and a dynamic rail allocation that reserves both communication endpoints of a message before sending it. The last method is shown to perform better than the others at higher loads: up to 49 % better than local-knowledge allocation and 37 % better than the roundrobin allocation. This allocation scheme also shows lower latency and it saturates at higher loads (for messages long enough). Most importantly, this proposed allocation scheme scales well with the number of rails and message sizes. In addition we propose a hybrid algorithm that combines the benefits of the local-dynamic for short messages with those of the dynamic algorithm for large messages.

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=?doi=10.1.1.11.3338
Source http://www.cs.huji.ac.il/~etcs/papers/practice02.pdf
Publisher IEEE Computer Society
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Keywords Communication Protocols, High-Performance Interconnection Networks, Performance Evaluation, Routing, Communication Libraries, Parallel Architectures
Type text
Language English
Relation 10.1.1.25.5413, 10.1.1.101.5043, 10.1.1.57.9273, 10.1.1.131.2650, 10.1.1.114.5773, 10.1.1.108.7727, 10.1.1.67.7822, 10.1.1.74.7815, 10.1.1.79.1652, 10.1.1.86.816, 10.1.1.89.2018, 10.1.1.94.27, 10.1.1.126.6962