Kamis, 19 Juni 2014

Parallel Computation

Parallelism Concept

Two Fundamental Hardware Techniques
Used To Increase Performance
1.          Parallelism
2.          pipelining
Parallelism
  • Multiple copies of hardware unit used
  • All copies can operate simultaneously
  • Occurs at many levels of architecture
  • Term parallel computer applied when parallelism dominates entire architecture

Characterizations Of Parallelism
Microscopic vs. macroscopic
Symmetric vs. asymmetric
Fine-grain vs. coarse-grain
Explicit vs. implicit
Types Of Parallel Architectures
Name                   Meaning
SISD                       Single Instruction Single Data stream
SIMD                     Single Instruction Multiple Data streams
MIMD                  Multiple Instructions Multiple Data streams
  
Distributed Processing

 Distributed processing is the type of processing whereby processing occurs on more than one processor in order for a transaction to be completed. In other words, processing is distributed across two or more machines and the processes are most likely not running at the same time.
The word distributed in terms such as "distributed system", "distributed programming", and "distributed algorithm" originally referred to computer networks where individual computers were physically distributed within some geographical area. The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing. While there is no single definition of a distributed system,[6] the following defining properties are commonly used:
·         There are several autonomous computational entities, each of which has its own local memory
·         The entities communicate with each other by message passing.
In this article, the computational entities are called computers or nodes.
A distributed system may have a common goal, such as solving a large computational problem. Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users.
Other typical properties of distributed systems include the following:
·         The system has to tolerate failures in individual computers.
·         The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program.
·         Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input.

 What is Parallel Architecture?


           A parallel computer is a collection of processing elements that cooperate to solve large problems fast. Extension of “computer architecture” to support communication and cooperation
• OLD:
Instruction Set Architecture
• NEW:
Communication Architecture
Defines
• Critical abstractions, boundaries, and primitives (interfaces)
• Organizational structures that implement interfaces (hw or sw)
Compilers, libraries and OS are important bridges
Communication Architecture
=User/System Interface+Implementation
User/System Interface:
• Comm. primitives exposed to user-level by hw and system-level sw
   Implementation:
• Organizational structures that implement the primitives: hw or OS
• How optimized are they? How integrated into processing node?
• Structure of network
Goals:
• Performance
• Broad applicability
• Programmability
• Scalability
• Low Cost
introduction thread programming
An Introduction to Programming with Threads
              Many experimental operating systems, and some commercial ones, have recently included support for concurrent programming. The most popular mechanism for this is some provision for allowing multiple lightweight “threads” within a single address space, used from within a single program. Programming with threads introduces new difficulties even for experienced programmers. Concurrent programming has techniques and pitfalls that do not occur in sequential programming. Many of the techniques are obvious, but some are obvious only with hindsight. Some of the pitfalls are comfortable (for example, deadlock is a pleasant sort of bug—your program stops with all the evidence intact), but some take the form of insidious performance penalties.
             
              The purpose of this paper is to give you an introduction to the programming techniques that work well with threads, and to warn you about techniques or interactions that work out badly. It should provide the experienced sequential programmer with enough hints to be able to build a substantial multi-threaded program that works—correctly, efficiently, and with a minimum of surprises.
             
              Having “multiple threads” in a program means that at any instant the program has multiple points of execution, one in each of its threads. The programmer can mostly view the threads as executing simultaneously, as if the computer were endowed with as many processors as there are threads. The programmer is required to decide when and where to create multiple threads, or to accept such decisions made for him by implementers of existing library packages or runtime systems. Additionally, the programmer must occasionally be aware that the computer might not in fact execute all his threads simultaneously.
              Having the threads execute within a “single address space” means that the computer’s addressing hardware is configured so as to permit the threads to read and write the same memory locations. In a high-level language, this usually corresponds to the fact that the off-stack (global) variables are shared among all the threads of the program. Each thread executes on a separate call stack with its own separate local variables. The programmer is responsible for using the synchronization mechanisms of the thread facility to ensure that the shared memory is accessed in a manner that will give the correct answer.
             Thread facilities are always advertised as being “lightweight”. This means that thread creation, existence, destruction and synchronization primitives are cheap enough that the programmer will use them for all his concurrency needs. Please be aware that I am presenting you with a selective, biased and idiosyncratic collection of techniques. Selective, because an exhaustive survey would be premature, and would be too exhausting to serve as an introduction—I will be discussing only the most important thread primitives, omitting features such as per-thread context information. Biased, because I present examples, problems and solutions in the context of one.
introduction programing CUDA

What is CUDA?
* CUDA Architecture
— Expose general-purpose GPU computing as first-class capability
— Retain traditional DirectX/OpenGL graphics performance
*CUDA C
— Based on industry-standard C
— A handful of language extensions to allow heterogeneous programs
— Straightforward APIs to manage devices, memory, etc.
*This talk will introduce you to CUDA CIntroduction to CUDA C
*What will you learn today?
— Start from “Hello, World!”
— Write and launch CUDA C kernels
— Manage GPU memory
— Run parallel kernels in CUDA C
— Parallel communication and synchronization
— Race conditions and atomic operations
CUDA C Prerequisites
·          You (probably) need experience with C or C++
·          You do not need any GPU experience
·          You do not need any graphics experience
·          You do not need any parallel programming experience
CUDA C: The Basics
Host
Note: Figure Not to Scale
·         Terminology
·         Host – The CPU and its memory (host memory)
·         Device – The GPU and its memory (device memory)
Memory Management
·         Host and device memory are distinct entities
     Device pointers point to GPU memory
·         May be passed to and from host code
·         May not be dereferenced from host code
     Host pointers point to CPU memory
·         May be passed to and from device code
·         May not be dereferenced from device code
·         Basic CUDA API for dealing with device memory
     cudaMalloc(), cudaFree(), cudaMemcpy()
     Similar to their C equivalents, malloc(), free(), memcpy()

 source:
http://www.eecs.wsu.edu/~hauser/teaching/Arch-F07/handouts/Chapter17.pdf
http://www.ask.com/question/what-is-distributed-processing
http://www.cis.upenn.edu/~lee/07cis505/Lec/lec-ch1-DistSys-v4.pdf
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/lect08.4up.pdf
https://birrell.org/andrew/papers/035-Threads.pdf
http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf
http://www.csse.monash.edu.au/~rdp/research/Papers/Parallelism_in_a_computer_architecture_to_support_orientation_changes_in_virtual_reality.pdf

Tidak ada komentar:

Posting Komentar