Parallel Computing Using Transputers

Parallel Computing Using Transputers

Chris Nevison

Colgate University

Hamilton, NY 13346

chris@cs.colgate.edu

Abstract

The transputer is a microprocessor designed specifically for parallel computing. It provides an economical way for putting together a laboratory for teaching parallel computing. We describe how the transputer can be used for teaching parallel computing and some materials that are available for teaching. We describe a demonstration program using the transputer, a parallel prime number sieve.

1. Introduction

It has become imperative that we include parallel computing in the undergraduate curriculum. (Nevison, 1995) discusses how this can be done. Here we describe one means of creating a laboratory for teaching parallel computing, a network of transputers, and some materials which can be used with such a laboratory. We use a parallel prime number sieve to demonstrate how several principles of parallel computing can be taught using a transputer network.

The material in this paper is based on our experience teaching parallel computing to undergraduates at Colgate University and to faculty from many colleges and universities at a series of three summer courses, "Parallel Computing for Undergraduate Faculty," held at Colgate with NSF sponsorship (NSF grants USE-8950067, USE-9054194, USE-9154145). In addition, two NSF sponsored workshops held at Colgate resulted in the production of Laboratories for Parallel Computing (Nevison, et.al, 1994), a collection of laboratories for teaching parallel computing on any message-passing architecture, including networks of workstations. (NSF grants USE-9150871, USE-9156031). I am indebted to the participants in these workshops, many Colgate students, and my colleagues who helped organize and teach the workshops, Tom Myers, Nan Schaller, Paul Tymann, Dan Hyde, and Michael Schneider.

A group of participants in these summer courses organized an interest group which we call the Undergraduate Parallel Computing Consortium (UParCC) to exchange ideas on teaching parallel computing and to organize meetings such as the ones held at Colgate and this one. UParCC is now accessible via a ListServ: to subscribe send a message to majordomo@cs.oswego.edu containing the line subscribe uparcc in the body of the message. In addition, we have established an archive at Colgate for materials relating to teaching parallel computing. It contains the proceedings from a conference "Parallel Computing for Undergraduates" held at Colgate in June, 1994. We hope to add more materials to this site in the

future, and participants at this meeting are invited to submit materials. To access these materials, you can use anonymous ftp to ftpcs.colgate.edu (149.43.80.10) and change to the directory /pub/parallel.

2. The Transputer

The transputer is a family of microprocessors developed by Inmos, Ltd., a member of the SGS-Thomson Microelectronics Group (transputer is a registered trademark of Inmos). It is designed to incorporate hardware assistance for interprocessor communications and intrachip concurrency. The T4 and T8 series transputers were developed in the mid-1980's and are 32 bit processors comparable in speed to Intel 386 and 486 processors. They differ in that the T8 series include a floating point processor. Each chip includes 4K on-chip memory and hardware for sending messages on four bidirectional I/O channels to other transputers. Inmos has recently produced the T9000 series processor, with about ten times the speed for both computation and communication as the T4/T8 series.

The four bidirectional I/O links on each transputer can be directly connected to other transputers with simple connectors. This makes it possible to construct a variety of networks. Several manufacturers produce boards for either MS-DOS PCs or Sun workstations which have one or several transputers on them. Usually one transputer communicates to the host machine via one of the four links and an adapter to connect to the system bus. The other transputers are connected to each other via the links. We have experimented with several network architectures, including the following: pipeline, ring, mesh, hypercube, cube-connected cycles.

In addition to supporting many possible network architectures through the use of message-passing on the links, the transputer also supports concurrency within the chip. There is hardware support for keeping track of job queues and sharing processing time among parallel processes using time-slicing. This makes running parallel processes within one processor very efficient. Furthermore, the message-passing model for on-chip parallel processes and for parallel processes distributed across a network is the same. This makes the process of programming and testing much easier.

The transputer was designed hand-in-hand with the language Occam, a language for message-passing parallel processes based on Hoare's CSP (Communicating Sequential Processes). In addition to Occam, there are parallel versions of C which use the same message-passing model of parallelism. These give two good choices for a language for teaching parallel computing using transputers.

3. Resources for Teaching Parallel Computing

In recent years several introductory textbooks an parallel computing which are suitable for an advanced undergraduate course have appeared. In addition there have been some efforts to integrate parallel computing throughout the undergraduate curriculum materials for doing this under development. See for example, (John, 1992 and 1994) , (Johnson, Kotz and Makedon, 1994) , (Paprzycki and Zalewski, 1994) and (Nevison, 1995). In addition, several presentations at this conference including those by Kotz, Zalewski and Savage, concern the integration of parallel computing throughout the curriculum.

The book Laboratories for Parallel Computing (Nevison, et. al., 1994) presents a collection of laboratories which are suitable for any message-passing architecture including networks of transputers, networks of workstations using PVM or P4, an InteL iPSC or Paragon, or a CM-5. The laboratories in this collection cover a range of applications. They can be used in the context of an introductory course on parallel computing and some of them might be included in another C.S. course with a unit on parallel computing, such as Programming Languages, Graphics, or Algorithms. The example discussed in the next section is taken from this collection.

4. A Parallel Prime Number Sieve: A Case Study

A prime number sieve is a classic benchmark for performance of integer computations. In (Nevison, et. al, 1994) we develop a parallel prime number sieve for a pipeline network of message-passing computers. (It has been implemented on at least three types of machine, transputers, nCube, and networked workstations using PVM). This example has been useful for demonstrating several important aspects of parallel computing including parallel program development, speedup and efficiency, load-balancing, message synchronization issues, and Amdahl's law. Here we will describe the development of the program, load-balancing and Amdahl's law, to give a sense of how this example can be used.

4.1 Program Development

The problem which is to be solved is simply to generate all the prime numbers less than or equal to a target number. This can be done by first generating all the primes less than the square root of the target number, then testing all other odd numbers less than the target by checking whether the remainder on division by one of the test primes is zero. Any zero remainder indicates that the candidate number is composite; only if all remainders are non-zero is the candidate number a prime.

It is helpful when developing a parallel algorithm to visualize the extent of parallelism in the problem. In this example, every test of a candidate by dividing by a test prime can, in principle, be done in parallel. This is illustrated in figure 4. Here, each of the remainders of the candidate, which labels the row, when divided by the test prime, which labels the column, can be computed simultaneously. The final column would have been initialized to "Yes", but any processor in a row which found a zero remainder would write a "No" to this column. Consequently, in principle we could determine all the primes less than the target in O(1) time, if only we had enough processors.

For a realistic number of processors we can harness this potential parallelism in several ways. To set up a pipeline architecture, we view each column as being assigned to a single processor. The candidate numbers are fed through the processors in a stream. When any zero is encountered, that candidate is dropped from the stream. Figure 5 shows what this algorithm looks like.

For a realistic problem, we will have far fewer processors than we have test primes. Consequently we must assign several test primes to each processor. We just group the processes in the pipeline shown in figure 5 and assign one group to each processor.

Rather than run many parallel processes on each processor, it is much more efficient to store the test primes in an array and use one process on each processor to do the checking. In addition, one or two buffer processes keep time discrepancies between neighboring processors from causing bottlenecks. We end up with a layout of the algorithm as shown in figure 6.

4.2 Load-Balancing

A naive approach to mapping the algorithm to the architecture would put a roughly equal number of filters for test primes on each processor. Timing this version of the program against the best sequential version results in rather poor speedup times. This is not surprising given a little thought about what each processor is doing. The first test prime filter will eliminate 1/3 of the candidates, the next will eliminate about 1/5 of those that remain, and so on. Consequently, the filters early in the pipeline "see" many more candidate than those later in the pipeline. With equal numbers of filters on each processors, the processors further down the pipeline will be doing much less work than the first few processors. Some means of balancing the load is needed.

An investigation of some prime number theory shows that the number of candidate numbers that remain in the stream by the time the kth prime, pk, is reached is approximately proportional to 1 / log pk. If we assign a load factor proportional to this to each filter, then assign filters to the processors so as to make the distribution of the total load as even as possible, then we get much better speedup results. The actual speedups depend, of course, on the target number and the number of processors used.

For example, on four processors generating all primes less than 1,000,000, the speedup without load-balancing is about 1.6 and efficiency 40%. However, with load-balancing, the speedup is 3.2 and efficiency 80%, (The particular figures depend on the language, compiler and systems being used. These figures are for T414 processors running Occam under the Inmos TDS system.)

4.3 Amdahl's Law

Several years ago Eugene Amdahl (1967) suggested that there are limits on how much parallel computing can gain for us in computing power. This is because any problem has an inherently serial part. Consequently, the computation time can never exceed the time needed for that serial part, thus putting a ceiling on the speedup which can be attained. For example, in our sieve algorithm the test primes and the load-balancing are computed before the parallel algorithm kicks in. Amdahl's law can be demonstrated by running the same problem on more and more processors. For example, the time needed to find all primes less than 1,000,000 for more and more processors is shown in the graph in figure 7. We see that we get good efficiency up to about 17 processors, but then the speedup curve levels off. Amdahl's law has taken over.

Gustafson, Montry, and Benner (1988) developed a response to Amdahl: The purpose of doing parallel computing is to do larger and larger problems. If the inherently sequential part of the problem does not grow in size as fast as the part which can be done in parallel, then larger parallel computers can be effective in solving these larger problems. In our example, the sequential part of the computation involves time proportional to the square root of the overall problem size, so this holds true. In figure 8 we see speedup curves for computing all primes less than 1,000,000, less than 9,000,000, and less than 100,000,000, for up to 64 processors. As the problem size grows, the advantage of parallel computing for more and more processors grows as well. However, the problem size is really much larger: for computing all primes less than 1,000,000 we reduce a single processor time of just under a minute to less than 20 seconds; but for all primes less than 100,000,000 we reduce about eight hours computation time to about fifteen minutes.

5. Conclusions

A laboratory suitable for teaching parallel computing can be constructed using a transputer network. This is a cost-effective way to enable students to do real parallel computing. There are instructional materials including laboratories available for use with this type of equipment, as well as with several other systems. The parallel prime number sieve is an example of how a manageable parallel program run on a transputer network, or some other message-passing system, can be used to demonstrate many of the principles of parallel computing.

6. References

Amdahl, Eugene, 1967. "Validity of the Sin-gle Processor Approach to Achieving Large-Scale Computer Capabilities," AFIPS Conference Proceedings 30, 483-485.

Gustafson, Montry, and Benner, 1988. "Development of Parallel Methods for a 1024-Processor Hypercube," SIAM Journal on Scientific and Statistical Computing 9.

John, David, 1992. "Integration of Parallel Computation into Introductory Computer Science," SIGCSE Bulletin, v 24, n. 1. pp. 281-285.

John, David, 1994. "NSF Supported Projects: Parallel Computation as an Integrated Component in the Undergraduate Curriculum in Computer Science," SIGCSE Bulletin, v 26, n. 1. pp. 357-361.

Johnson, Donald, David Kotz and Fillia Makedon, 1994. "Teaching Parallel Computation to Freshmen," Parallel Computing for Undergraduates, conference proceedings. Hamilton, NY: Colgate University. (Also available in electronic form via ftp.)

Nevison, Christopher H., 1995. "Parallel Computing in the Undergraduate Curriculum," IEEE Computer, to appear.

Nevison, Christopher H., Daniel C. Hyde, G. Michael Schneider, and Paul T Tymann, eds., 1994. Laboratories for Parallel Computing. Boston: Jones and Bartlett.

Nevison, Christopher, ed. 1994. Parallel Computing for Undergraduates. Hamilton, NY: Colgate University.

Paprzycki, Marcin and Janusz Zalewski, 1994. "Teaching Parallel Computing without a Separate Course," Parallel Computing for Undergraduates, conference proceedings. Hamilton, NY: Colgate University. (Also to be available in electronic form via ftp.)