 
  
  
  
  
 
We suggest the following approach to obtain high performance with ScaLAPACK codes:
The standard data distribution will typically achieve 25-50% 
of the peak performance possible  (depending 
in part on how many processors are ignored, i.e., the difference
between  and
 and  ).  We do not
recommend experimenting with different data distributions until
performance that is acceptable (or nearly so) has been achieved.
If each individual node requires a block size larger than 64 to 
achieve near-peak performance on local matrix-matrix multiply,
the block size may have to be increased.  This step is unlikely, however, 
unless the computer has a shared-memory multiprocessor with 
more than four processors on each node.
).  We do not
recommend experimenting with different data distributions until
performance that is acceptable (or nearly so) has been achieved.
If each individual node requires a block size larger than 64 to 
achieve near-peak performance on local matrix-matrix multiply,
the block size may have to be increased.  This step is unlikely, however, 
unless the computer has a shared-memory multiprocessor with 
more than four processors on each node.