Recent from talks
Nothing was collected or created yet.
Basic Linear Algebra Subprograms
View on Wikipedia
| BLAS | |
|---|---|
| Stable release | 3.11.0
/ 11 November 2022 |
| Written in | depends on implementation |
| Platform | Cross-platform |
| Type | Library |
| Website | www |
Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C ("CBLAS interface") and Fortran ("BLAS interface"). Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.
It originated as a Fortran library in 1979[1] and its interface was standardized by the BLAS Technical (BLAST) Forum, whose latest BLAS report can be found on the netlib website.[2] This Fortran library is known as the reference implementation (sometimes confusingly referred to as the BLAS library) and is not optimized for speed but is in the public domain.[3][4]
Most libraries that offer linear algebra routines conform to the BLAS interface, allowing library users to develop programs that are indifferent to the BLAS library being used.
Many BLAS libraries have been developed, targeting various different hardware platforms. Examples includes cuBLAS (NVIDIA GPU, GPGPU), rocBLAS (AMD GPU), and OpenBLAS. Examples of CPU-based BLAS library branches include: OpenBLAS, BLIS (BLAS-like Library Instantiation Software), Arm Performance Libraries,[5] ATLAS, and Intel Math Kernel Library (iMKL). AMD maintains a fork of BLIS that is optimized for the AMD platform.[6] ATLAS is a portable library that automatically optimizes itself for an arbitrary architecture. iMKL is a freeware[7] and proprietary[8] vendor library optimized for x86 and x86-64 with a performance emphasis on Intel processors.[9] OpenBLAS is an open-source library that is hand-optimized for many of the popular architectures. The LINPACK benchmarks rely heavily on the BLAS routine gemm for its performance measurements.
Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including LAPACK, LINPACK, Armadillo, GNU Octave, Mathematica,[10] MATLAB,[11] NumPy,[12] R, Julia and Lisp-Stat.
The C++ std::linalg library, introduced in C++26, is based on BLAS.
Background
[edit]With the advent of numerical programming, sophisticated subroutine libraries became useful. These libraries would contain subroutines for common high-level mathematical operations such as root finding, matrix inversion, and solving systems of equations. The language of choice was FORTRAN. The most prominent numerical programming library was IBM's Scientific Subroutine Package (SSP).[13] These subroutine libraries allowed programmers to concentrate on their specific problems and avoid re-implementing well-known algorithms. The library routines would also be better than average implementations; matrix algorithms, for example, might use full pivoting to get better numerical accuracy. The library routines would also have more efficient routines. For example, a library may include a program to solve a matrix that is upper triangular. The libraries would include single-precision and double-precision versions of some algorithms.
Initially, these subroutines used hard-coded loops for their low-level operations. For example, if a subroutine needed to perform a matrix multiplication, then the subroutine would have three nested loops. Linear algebra programs have many common low-level operations (the so-called "kernel" operations, not related to operating systems).[14] Between 1973 and 1977, several of these kernel operations were identified.[15] These kernel operations became defined subroutines that math libraries could call. The kernel calls had advantages over hard-coded loops: the library routine would be more readable, there were fewer chances for bugs, and the kernel implementation could be optimized for speed. A specification for these kernel operations using scalars and vectors, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979.[16] BLAS was used to implement the linear algebra subroutine library LINPACK.
The BLAS abstraction allows customization for high performance. For example, LINPACK is a general purpose library that can be used on many different machines without modification. LINPACK could use a generic version of BLAS. To gain performance, different machines might use tailored versions of BLAS. As computer architectures became more sophisticated, vector machines appeared. BLAS for a vector machine could use the machine's fast vector operations. (While vector processors eventually fell out of favor, vector instructions in modern CPUs are essential for optimal performance in BLAS routines.)
Other machine features became available and could also be exploited. Consequently, BLAS was augmented from 1984 to 1986 with level-2 kernel operations that concerned vector-matrix operations. Memory hierarchy was also recognized as something to exploit. Many computers have cache memory that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. In 1987 and 1988, the level 3 BLAS were identified to do matrix-matrix operations. The level 3 BLAS encouraged block-partitioned algorithms. The LAPACK library uses level 3 BLAS.[17]
The original BLAS concerned only densely stored vectors and matrices. Further extensions to BLAS, such as for sparse matrices, have been addressed.[18]
Functionality
[edit]BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take linear time, O(n), Level 2 operations quadratic time and Level 3 operations cubic time.[19] Modern BLAS implementations typically provide all three levels.
Level 1
[edit]This level consists of all the routines described in the original presentation of BLAS (1979),[1] which defined only vector operations on strided arrays: dot products, vector norms, a generalized vector addition of the form
(called "axpy", "a x plus y") and several other operations.
Level 2
[edit]This level contains matrix-vector operations including, among other things, a generalized matrix-vector multiplication (gemv):
as well as a solver for x in the linear equation
with T being triangular. Design of the Level 2 BLAS started in 1984, with results published in 1988.[20] The Level 2 subroutines are especially intended to improve performance of programs using BLAS on vector processors, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."[20]
Level 3
[edit]This level, formally published in 1990,[19] contains matrix-matrix operations, including a "general matrix multiplication" (gemm), of the form
where A and B can optionally be transposed or hermitian-conjugated inside the routine, and all three matrices may be strided. The ordinary matrix multiplication A B can be performed by setting α to one and C to an all-zeros matrix of the appropriate size.
Also included in Level 3 are routines for computing
where T is a triangular matrix, among other functionality.
Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,[21] and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, gemm is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of A, B into block matrices, gemm can be implemented recursively. This is one of the motivations for including the β parameter,[dubious – discuss] so the results of previous blocks can be accumulated. Note that this decomposition requires the special case β = 1 which many implementations optimize for, thereby eliminating one multiplication for each value of C. This decomposition allows for better locality of reference both in space and time of the data used in the product. This, in turn, takes advantage of the cache on the system.[22] For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as ATLAS. More recently, implementations by Kazushige Goto have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS.[23] A highly tuned implementation based on these ideas is part of the GotoBLAS, OpenBLAS and BLIS.
A common variation of gemm is the gemm3m, which calculates a complex product using "three real matrix multiplications and five real matrix additions instead of the conventional four real matrix multiplications and two real matrix additions", an algorithm similar to Strassen algorithm first described by Peter Ungar.[24]
Implementations
[edit]- Accelerate
- Apple's framework for macOS and iOS, which includes tuned versions of BLAS and LAPACK.[25][26]
- Arm Performance Libraries
- Arm Performance Libraries, supporting Arm 64-bit AArch64-based processors, available from Arm.[5]
- ATLAS
- Automatically Tuned Linear Algebra Software, an open source implementation of BLAS APIs for C and Fortran 77.[27]
- BLIS
- BLAS-like Library Instantiation Software framework for rapid instantiation. Optimized for most modern CPUs. BLIS is a complete refactoring of the GotoBLAS that reduces the amount of code that must be written for a given platform.[28][29]
- C++ AMP BLAS
- The C++ AMP BLAS Library is an open source implementation of BLAS for Microsoft's AMP language extension for Visual C++.[30]
- cuBLAS
- Optimized BLAS for Nvidia-based GPU cards, requiring few additional library calls.[31]
- NVBLAS
- Optimized BLAS for Nvidia-based GPU cards, providing only Level 3 functions, but as direct drop-in replacement for other BLAS libraries.[32]
- clBLAS
- An OpenCL implementation of BLAS by AMD. Part of the AMD Compute Libraries.[33]
- clBLAST
- A tuned OpenCL implementation of most of the BLAS api.[34]
- Eigen BLAS
- A Fortran 77 and C BLAS library implemented on top of the MPL-licensed Eigen library, supporting x86, x86-64, ARM (NEON), and PowerPC architectures.
- ESSL
- IBM's Engineering and Scientific Subroutine Library, supporting the PowerPC architecture under AIX and Linux.[35]
- GotoBLAS
- Kazushige Goto's BSD-licensed implementation of BLAS, tuned in particular for Intel Nehalem/Atom, VIA Nanoprocessor, AMD Opteron.[36]
- GNU Scientific Library
- Multi-platform implementation of many numerical routines. Contains a CBLAS interface.
- HP MLIB
- HP's Math library supporting IA-64, PA-RISC, x86 and Opteron architecture under HP-UX and Linux.
- Intel MKL
- The Intel Math Kernel Library, supporting x86 32-bits and 64-bits, available free from Intel.[7] Includes optimizations for Intel Pentium, Core and Intel Xeon CPUs and Intel Xeon Phi; support for Linux, Windows and macOS.[37]
- MathKeisan
- NEC's math library, supporting NEC SX architecture under SUPER-UX, and Itanium under Linux[38]
- Netlib BLAS
- The official reference implementation on Netlib, written in Fortran 77.[39]
- Netlib CBLAS
- Reference C interface to the BLAS. It is also possible (and popular) to call the Fortran BLAS from C.[40]
- OpenBLAS
- Optimized BLAS based on GotoBLAS, supporting x86, x86-64, MIPS and ARM processors.[41]
- PDLIB/SX
- NEC's Public Domain Mathematical Library for the NEC SX-4 system.[42]
- rocBLAS
- Implementation that runs on AMD GPUs via ROCm.[43]
- SCSL
- SGI's Scientific Computing Software Library contains BLAS and LAPACK implementations for SGI's Irix workstations.[44]
- Sun Performance Library
- Optimized BLAS and LAPACK for SPARC, Core and AMD64 architectures under Solaris 8, 9, and 10 as well as Linux.[45]
- uBLAS
- A generic C++ template class library providing BLAS functionality. Part of the Boost library. It provides bindings to many hardware-accelerated libraries in a unifying notation. Moreover, uBLAS focuses on correctness of the algorithms using advanced C++ features.[46]
Libraries using BLAS
[edit]- Armadillo
- Armadillo is a C++ linear algebra library aiming towards a good balance between speed and ease of use. It employs template classes, and has optional links to BLAS/ATLAS and LAPACK. It is sponsored by NICTA (in Australia) and is licensed under a free license.[47]
- LAPACK
- LAPACK is a higher level Linear Algebra library built upon BLAS. Like BLAS, a reference implementation exists, but many alternatives like libFlame and MKL exist.
- Mir
- An LLVM-accelerated generic numerical library for science and machine learning written in D. It provides generic linear algebra subprograms (GLAS). It can be built on a CBLAS implementation.[48]
Similar libraries (not compatible with BLAS)
[edit]- Elemental
- Elemental is an open source software for distributed-memory dense and sparse-direct linear algebra and optimization.[49]
- HASEM
- is a C++ template library, being able to solve linear equations and to compute eigenvalues. It is licensed under BSD License.[50]
- LAMA
- The Library for Accelerated Math Applications (LAMA) is a C++ template library for writing numerical solvers targeting various kinds of hardware (e.g. GPUs through CUDA or OpenCL) on distributed memory systems, hiding the hardware specific programming from the program developer
- MTL4
- The Matrix Template Library version 4 is a generic C++ template library providing sparse and dense BLAS functionality. MTL4 establishes an intuitive interface (similar to MATLAB) and broad applicability thanks to generic programming.
Sparse BLAS
[edit]Several extensions to BLAS for handling sparse matrices have been suggested over the course of the library's history; a small set of sparse matrix kernel routines was finally standardized in 2002.[51]
Batched BLAS
[edit]The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as GPUs. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified.[52]
Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices:
The index in square brackets indicates that the operation is performed for all matrices in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays , and .
Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of exponential integrators and Magnus integrators that handle long integration periods with many time steps.[53] Here, the matrix exponentiation, the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions.
See also
[edit]- List of numerical libraries
- Math Kernel Library, math library optimized for the Intel architecture; includes BLAS, LAPACK
- Numerical linear algebra, the type of problem BLAS solves
References
[edit]- ^ a b *Lawson, C. L.; Hanson, R. J.; Kincaid, D.; Krogh, F. T. (1979). "Basic Linear Algebra Subprograms for FORTRAN usage". ACM Trans. Math. Softw. 5 (3): 308–323. doi:10.1145/355841.355847. hdl:2060/19780018835. S2CID 6585321. Algorithm 539.
- ^ "BLAS Technical Forum". netlib.org. Retrieved 2017-07-07.
- ^ blaseman Archived 2016-10-12 at the Wayback Machine "The products are the implementations of the public domain BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), which have been developed by groups of people such as Prof. Jack Dongarra, University of Tennessee, USA and all published on the WWW (URL: https://www.netlib.org/)."[permanent dead link]
- ^ Jack Dongarra; Gene Golub; Eric Grosse; Cleve Moler; Keith Moore. "Netlib and NA-Net: building a scientific computing community" (PDF). netlib.org. Retrieved 2016-02-13.
The Netlib software repository was created in 1984 to facilitate quick distribution of public domain software routines for use in scientific computation.
- ^ a b "Arm Performance Libraries". Arm. 2020. Retrieved 2020-12-16.
- ^ "BLAS Library".
- ^ a b "No Cost Options for Intel Math Kernel Library (MKL), Support yourself, Royalty-Free". Intel. 2015. Retrieved 2015-08-31.
- ^ "Intel Math Kernel Library (Intel MKL)". Intel. 2015. Retrieved 2015-08-25.
- ^ "Optimization Notice". Intel. 2012. Retrieved 2013-04-10.
- ^ Douglas Quinney (2003). "So what's new in Mathematica 5.0?" (PDF). MSOR Connections. 3 (4). The Higher Education Academy. Archived from the original (PDF) on 2013-10-29.
- ^ Cleve Moler (2000). "MATLAB Incorporates LAPACK". MathWorks. Retrieved 2013-10-26.
- ^ Stéfan van der Walt; S. Chris Colbert & Gaël Varoquaux (2011). "The NumPy array: a structure for efficient numerical computation". Computing in Science and Engineering. 13 (2): 22–30. arXiv:1102.1523. Bibcode:2011CSE....13b..22V. doi:10.1109/MCSE.2011.37. S2CID 16907816.
- ^ Boisvert, Ronald F. (2000). "Mathematical software: past, present, and future". Mathematics and Computers in Simulation. 54 (4–5): 227–241. arXiv:cs/0004004. Bibcode:2000cs........4004B. doi:10.1016/S0378-4754(00)00185-3. S2CID 15157725.
- ^ Even the SSP (which appeared around 1966) had some basic routines such as RADD (add rows), CADD (add columns), SRMA (scale row and add to another row), and RINT (row interchange). These routines apparently were not used as kernel operations to implement other routines such as matrix inversion. See IBM (1970), System/360 Scientific Subroutine Package, Version III, Programmer's Manual (5th ed.), International Business Machines, GH20-0205-4.
- ^ BLAST Forum 2001, p. 1.
- ^ Lawson et al. 1979.
- ^ BLAST Forum 2001, pp. 1–2.
- ^ BLAST Forum 2001, p. 2.
- ^ a b Dongarra, Jack J.; Du Croz, Jeremy; Hammarling, Sven; Duff, Iain S. (1990). "A set of level 3 basic linear algebra subprograms". ACM Transactions on Mathematical Software. 16 (1): 1–17. doi:10.1145/77626.79170. ISSN 0098-3500. S2CID 52873593.
- ^ a b Dongarra, Jack J.; Du Croz, Jeremy; Hammarling, Sven; Hanson, Richard J. (1988). "An extended set of FORTRAN Basic Linear Algebra Subprograms". ACM Trans. Math. Softw. 14: 1–17. CiteSeerX 10.1.1.17.5421. doi:10.1145/42288.42291. S2CID 3579623.
- ^ Goto, Kazushige; van de Geijn, Robert A. (2008). "High-performance implementation of the level-3 BLAS" (PDF). ACM Transactions on Mathematical Software. 35 (1): 1–14. doi:10.1145/1377603.1377607. S2CID 14722514. Archived from the original (PDF) on 2017-07-06.
- ^ Golub, Gene H.; Van Loan, Charles F. (1996), Matrix Computations (3rd ed.), Johns Hopkins, ISBN 978-0-8018-5414-9
- ^ Goto, Kazushige; van de Geijn, Robert A. (2008). "Anatomy of High-Performance Matrix Multiplication". ACM Transactions on Mathematical Software. 34 (3): 12:1–12:25. CiteSeerX 10.1.1.111.3873. doi:10.1145/1356052.1356053. ISSN 0098-3500. S2CID 9359223. (25 pages) [1]
- ^ Van Zee, Field G.; Smith, Tyler M. (2017-07-24). "Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods". ACM Transactions on Mathematical Software. 44 (1): 1–36. doi:10.1145/3086466. S2CID 25580883.
- ^ "Guides and Sample Code". developer.apple.com. Retrieved 2017-07-07.
- ^ "Guides and Sample Code". developer.apple.com. Retrieved 2017-07-07.
- ^ "Automatically Tuned Linear Algebra Software (ATLAS)". math-atlas.sourceforge.net. Retrieved 2017-07-07.
- ^ blis: BLAS-like Library Instantiation Software Framework, flame, 2017-06-30, retrieved 2017-07-07
- ^ BLIS GitHub Repository, 2021-10-15
- ^ "C++ AMP BLAS Library". CodePlex. Archived from the original on 2017-07-08. Retrieved 2017-07-07.
- ^ "cuBLAS". NVIDIA Developer. 2013-07-29. Retrieved 2017-07-07.
- ^ "NVBLAS". NVIDIA Developer. 2018-05-15. Retrieved 2018-05-15.[permanent dead link]
- ^ clBLAS: a software library containing BLAS functions written in OpenCL, clMathLibraries, 2017-07-03, retrieved 2017-07-07
- ^ Nugteren, Cedric (2017-07-05), CLBlast: Tuned OpenCL BLAS, retrieved 2017-07-07
- ^ IBM Knowledge Centre: Engineering and Scientific Subroutine Library
- ^ Milfeld, Kent. "GotoBLAS2". Texas Advanced Computing Center. Archived from the original on 2020-03-23. Retrieved 2024-03-17.
- ^ "Intel Math Kernel Library (Intel MKL) | Intel Software". software.intel.com. Retrieved 2017-07-07.
- ^ Mathkeisan, NEC. "MathKeisan". www.mathkeisan.com. Retrieved 2017-07-07.
- ^ "BLAS (Basic Linear Algebra Subprograms)". www.netlib.org. Retrieved 2017-07-07.
- ^ "BLAS (Basic Linear Algebra Subprograms)". www.netlib.org. Retrieved 2017-07-07.
- ^ "OpenBLAS: An optimized BLAS library". www.openblas.net. Retrieved 2017-07-07.
- ^ "PDLIB/SX: Business Solution | NEC". Archived from the original on 2007-02-22. Retrieved 2007-05-20.
- ^ "rocBLAS". rocmdocs.amd.com. Archived from the original on 2021-05-22. Retrieved 2021-05-21.
- ^ "SGI - SCSL Scientific Library: Home Page". Archived from the original on 2007-05-13. Retrieved 2007-05-20.
- ^ "Oracle Developer Studio". www.oracle.com. Retrieved 2017-07-07.
- ^ "Boost Basic Linear Algebra - 1.60.0". www.boost.org. Retrieved 2017-07-07.
- ^ "Armadillo: C++ linear algebra library". arma.sourceforge.net. Retrieved 2017-07-07.
- ^ "Dlang Numerical and System Libraries". GitHub.
- ^ "Elemental: distributed-memory dense and sparse-direct linear algebra and optimization — Elemental". libelemental.org. Archived from the original on 2013-08-01. Retrieved 2017-07-07.
- ^ "HASEM". SourceForge. 2015-08-17. Retrieved 2017-07-07.
- ^ Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan (2002). "An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum". ACM Transactions on Mathematical Software. 28 (2): 239–267. doi:10.1145/567806.567810. S2CID 9411006.
- ^ Dongarra, Jack; Hammarling, Sven; Higham, Nicholas J.; Relton, Samuel D.; Valero-Lara, Pedro; Zounon, Mawussi (2017). "The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems". Procedia Computer Science. 108: 495–504. doi:10.1016/j.procs.2017.05.138. hdl:2117/106913.
- ^ Herb, Konstantin; Welter, Pol (2022). "Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines". Computer Physics Communications. 270 108181. arXiv:2108.07126. Bibcode:2022CoPhC.27008181H. doi:10.1016/j.cpc.2021.108181. S2CID 237091802.
Further reading
[edit]- BLAST Forum (2001-08-21), Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard, Knoxville, TN: University of Tennessee
- Dodson, D. S.; Grimes, R. G. (1982), "Remark on algorithm 539: Basic Linear Algebra Subprograms for Fortran usage", ACM Trans. Math. Softw., 8 (4): 403–404, doi:10.1145/356012.356020, S2CID 43081631
- Dodson, D. S. (1983), "Corrigendum: Remark on "Algorithm 539: Basic Linear Algebra Subroutines for FORTRAN usage"", ACM Trans. Math. Softw., 9: 140, doi:10.1145/356022.356032, S2CID 22163977
- J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm 656: An extended set of FORTRAN Basic Linear Algebra Subprograms, ACM Trans. Math. Softw., 14 (1988), pp. 18–32.
- J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Softw., 16 (1990), pp. 1–17.
- J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Softw., 16 (1990), pp. 18–28.
- New BLAS
- L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, An Updated Set of Basic Linear Algebra Subprograms (BLAS), ACM Trans. Math. Softw., 28-2 (2002), pp. 135–151.
- J. Dongarra, Basic Linear Algebra Subprograms Technical Forum Standard, International Journal of High Performance Applications and Supercomputing, 16(1) (2002), pp. 1–111, and International Journal of High Performance Applications and Supercomputing, 16(2) (2002), pp. 115–199.
External links
[edit]- BLAS homepage on Netlib.org
- BLAS FAQ
- BLAS Quick Reference Guide from LAPACK Users' Guide
- Lawson Oral History One of the original authors of the BLAS discusses its creation in an oral history interview. Charles L. Lawson Oral history interview by Thomas Haigh, 6 and 7 November 2004, San Clemente, California. Society for Industrial and Applied Mathematics, Philadelphia, PA.
- Dongarra Oral History In an oral history interview, Jack Dongarra explores the early relationship of BLAS to LINPACK, the creation of higher level BLAS versions for new architectures, and his later work on the ATLAS system to automatically optimize BLAS for particular machines. Jack Dongarra, Oral history interview by Thomas Haigh, 26 April 2005, University of Tennessee, Knoxville TN. Society for Industrial and Applied Mathematics, Philadelphia, PA
- How does BLAS get such extreme performance? Ten naive 1000×1000 matrix multiplications (1010 floating point multiply-adds) takes 15.77 seconds on 2.6 GHz processor; BLAS implementation takes 1.32 seconds.
- An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum [2]
Basic Linear Algebra Subprograms
View on GrokipediaHistory and Development
Origins in Numerical Computing
The development of the Basic Linear Algebra Subprograms (BLAS) originated in the early 1970s at the Jet Propulsion Laboratory (JPL), driven by the growing need for portable and efficient computational kernels in scientific and engineering applications, particularly for solving large-scale linear systems and eigenvalue problems on emerging high-performance computers. Researchers recognized that numerical software libraries, such as those used in physics simulations and optimization, required standardized building blocks to ensure reliability and performance across diverse hardware architectures, avoiding the inefficiencies of ad hoc implementations.[6] This initiative was led by C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh at JPL, with foundational work later integrated at Argonne National Laboratory for projects like LINPACK, where the focus was on creating modular routines that could abstract low-level vector mathematics, allowing higher-level algorithms to remain hardware-agnostic while benefiting from vendor-optimized code.[7] Jack Dongarra at Argonne National Laboratory contributed to integrating BLAS into LINPACK, a comprehensive package for linear equation solving developed from 1976 to 1979 with collaborators including Jim Bunch, Cleve Moler, and others, that used these subprograms as its computational core.[6] The University of Tennessee later became involved as Dongarra transitioned there, facilitating ongoing refinements. The emphasis was on vector operations—such as dot products, saxpy (scalar-vector multiplication and addition), and norms—essential for iterative methods in linear algebra, ensuring these kernels could be reused across multiple algorithms without redundancy.[6] The first proposal emerged in 1973 as a collection of Fortran subroutines, implementing basic vector mathematics with a simple, callable interface that prioritized modularity to enable machine-specific tuning without altering user-facing code. This design allowed implementers to replace generic versions with optimized ones tailored to particular processors, fostering portability while maximizing computational efficiency in resource-constrained environments. A final report was issued in 1977, with publication in 1979.[7][6] Early challenges centered on achieving portability across supercomputers like the Cray-1 vector machines and IBM systems, which featured disparate instruction sets, memory models, and floating-point precisions that complicated uniform performance. At the time, numerical libraries suffered from a lack of standardized interfaces, leading to fragmented codebases that hindered collaboration and maintenance among research institutions; BLAS addressed this by proposing a de facto standard through its integration into widely adopted tools like LINPACK.[6] These efforts set the stage for subsequent expansions into higher-level operations, though the initial focus remained on foundational vector kernels.Evolution of BLAS Levels and Standards
The Basic Linear Algebra Subprograms (BLAS) originated with Level 1 in 1979, providing a standardized set of vector-vector operations designed to serve as efficient building blocks for numerical software, particularly in response to the need for portable and optimized linear algebra routines. This initial specification, comprising 38 Fortran subprograms for operations such as dot products and vector scaling, was developed by C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh to address inefficiencies in ad hoc implementations within packages like LINPACK and EISPACK, enabling developers to tune performance for specific hardware while maintaining interface consistency.[7] As computational demands grew with the advent of vector processors in the mid-1980s, the BLAS evolved to include Level 2 in 1985, introducing matrix-vector operations to better exploit these architectures' capabilities for higher throughput in linear algebra tasks. This extension built on Level 1 by adding routines for operations like matrix-vector multiplication and rank-one updates, formalized in a 1988 standardization proposal that emphasized portability across emerging vector and early parallel machines.[8] Level 3 followed in 1990, focusing on matrix-matrix operations to further optimize block-based algorithms on multiprocessor systems, driven by the integration needs of evolving numerical libraries like LAPACK for dense linear systems.[9] A pivotal milestone in BLAS standardization came in 1988 with a comprehensive proposal by Dongarra et al., which outlined the unified framework for Levels 1 through 3, promoting widespread adoption by establishing a de facto interface for high-performance computing.[8] The BLAS Technical Forum, with meetings beginning in 1995, led to discussions in the late 1990s and addressed limitations in precision and data types, resulting in an updated standard published in 2002 that incorporated support for extended and double-precision complex numbers to enhance accuracy in advanced scientific applications.[10][11] This evolution reflected ongoing adaptations to hardware advancements, ensuring BLAS remained a foundational standard for linear algebra through the early 2000s.Core Functionality
Level 1: Vector Operations
Level 1 BLAS routines provide the foundational operations for vector-vector computations in numerical linear algebra, focusing exclusively on manipulations between one or two vectors of length without involving matrices. These operations exhibit time complexity, making them efficient for sequential data access and essential as kernels in more complex algorithms. Originally specified as a set of 38 Fortran subprograms, the core Level 1 routines encompass basic manipulations like scaling, copying, and inner products, designed to be portable across computing platforms while allowing vendor-specific optimizations.[12] The nine core routines are: vector scaling (SCAL), vector copy (COPY), vector swap (SWAP), vector addition with scaling (AXPY), dot product (DOT), Euclidean norm (NRM2), sum of absolute values (ASUM), and index of maximum magnitude (IAMAX). These routines support both real and complex data types, with interfaces following Fortran conventions where vectors are passed by reference and parameters specify the vector length , the scalar where applicable, the vectors and , and increments and for non-contiguous storage (defaulting to 1 for contiguous). If , the routines return immediately without computation. All operations handle general increments to access vector elements at arbitrary strides, enhancing flexibility for strided data structures.[12] SCAL multiplies each element of a vector by a scalar , updating in place.The mathematical formulation is: The interface is
CALL SCAL(n, α, x, incx), where is modified. This routine is crucial for normalizing vectors or applying scaling factors in iterative processes.[12]
COPY copies the elements of vector into vector .The mathematical formulation is: The interface is
CALL COPY(n, x, incx, y, incy), where receives the copy and remains unchanged. It facilitates initialization or duplication of vectors for subsequent operations.[12]
SWAP interchanges the elements of two vectors and .The mathematical formulation is: The interface is
CALL SWAP(n, x, incx, y, incy), exchanging contents between and . This supports efficient vector reordering without temporary storage.[12]
AXPY computes a linear combination of two vectors, adding a scaled to .The mathematical formulation is: The interface is
CALL AXPY(n, α, x, incx, y, incy), where is updated in place. As a fundamental update operation, it forms the basis for accumulation steps in many algorithms.[12]
DOT computes the inner product of two vectors and .The mathematical formulation for real vectors is: For complex vectors, variants include the unconjugated product or Hermitian (conjugate) form. The interface is
p = DOT(n, x, incx, y, incy), returning the scalar result . This routine is key for computing residuals or orthogonality measures.[12]
NRM2 calculates the Euclidean (L2) norm of a vector .The mathematical formulation is: The interface is
r = NRM2(n, x, incx), returning the scalar norm . It avoids overflow by scaling during summation, providing a stable measure of vector magnitude.[12]
ASUM computes the sum of the absolute values of the vector elements.The mathematical formulation for real vectors is: for complex, it sums the absolute values of real and imaginary parts separately. The interface is
s = ASUM(n, x, incx), returning the scalar sum . This L1-like measure is useful for error estimation or convergence checks.[12]
IAMAX finds the index of the element with the largest absolute value in the vector.The mathematical formulation identifies the smallest index such that (for real) or using the sum of absolute real and imaginary parts (for complex). The interface is
i = IAMAX(n, x, incx), returning the integer index (1-based). It aids in pivoting or identifying dominant components.[12]
These Level 1 routines serve as building blocks for higher-level numerical methods, particularly iterative solvers such as the conjugate gradient method, where operations like DOT for inner products, AXPY for updates, and NRM2 for residual norms enable efficient convergence checks and vector manipulations. Higher BLAS levels build upon these for matrix-involved computations, but Level 1 remains vital for their vector kernel implementations.[13]
Level 2: Matrix-Vector Operations
Level 2 of the Basic Linear Algebra Subprograms (BLAS) encompasses routines designed for efficient matrix-vector operations, primarily involving dense and structured matrices interacting with single vectors. These operations build upon the vector manipulations of Level 1 by incorporating matrix structures, enabling computations such as multiplications and updates that are fundamental to algorithms in numerical linear algebra, including iterative solvers and preconditioners. The routines are optimized for memory access patterns, particularly in column-major storage order, where matrices are stored with a leading dimension (lda) that specifies the stride between columns to accommodate non-contiguous allocations.[14] The general matrix-vector multiplication routine, GEMV, computes the linear combination , where is an dense matrix, is an -element input vector, is an -element output vector, and are scalars. This operation supports transposition of via the parameter transA, which can be 'N' for no transpose, 'T' for transpose (), or 'C' for conjugate transpose ( in complex cases), allowing flexible handling of row- or column-wise computations without explicit matrix reformatting. The computational complexity of GEMV is for dense matrices, reflecting the need to access each element of once per vector multiplication, which influences cache efficiency in implementations.[14] For triangular matrices, the TRMV routine performs the multiplication (or its transpose/conjugate transpose variants), where is an unit or non-unit upper or lower triangular matrix specified by the uplo parameter ('U' or 'L') and diag flag ('U' for unit diagonal or 'N' otherwise). The related TRSV routine solves the triangular system (again with transpose options), overwriting the right-hand side vector with the solution , and is particularly useful in the forward or backward substitution phases of LU decomposition-based solvers. Optimizations in TRSV exploit the triangular structure to avoid divisions by unity on the diagonal when diag='U', reducing operations to while improving numerical stability for well-conditioned systems. Both routines adhere to column-major storage with lda .[14] Rank-one update operations in Level 2 extend matrices using outer products of vectors. The GER routine updates a general matrix as , where is an -vector and is an -vector, providing a building block for low-rank adjustments in methods like Gram-Schmidt orthogonalization. For symmetric matrices, SYR computes on the upper or lower triangle (uplo-specified), storing the result symmetrically to halve memory access. The Hermitian variant, HER, handles complex symmetric cases with , where is real and denotes the conjugate transpose, essential for operations on positive-definite forms in quantum chemistry and signal processing. These updates have complexity, matching GEMV, and support incx/inxy strides for non-unit vector increments.[14]Level 3: Matrix-Matrix Operations
Level 3 BLAS routines provide high-performance interfaces for matrix-matrix operations, enabling efficient computation of products and updates between dense matrices, which form the foundation for many numerical algorithms in linear algebra.[1] These operations are designed to exploit modern computer architectures by minimizing data movement through block-based algorithms that optimize cache usage, contrasting with the lower levels' focus on vector or single-vector interactions.[15] The routines assume column-major storage for matrices, where the leading dimension parameter (e.g., LDA) specifies the distance in memory between consecutive columns, allowing for non-contiguous or rectangular submatrices without copying data.[16] The cornerstone routine is the general matrix multiply (GEMM), which computes the product of two matrices with optional scaling and addition to a third. Specifically, for matrices A of size m × k, B of size k × n, and C of size m × n, GEMM performs where and are scalars, and is either X (no transpose), X^T (transpose), or X^H (conjugate transpose for complex data), controlled by parameters TRANSA and TRANSB.[17] This operation has a computational complexity of O(m n k) floating-point operations, making it dominant in dense linear algebra solvers where repeated calls can achieve near-peak hardware performance through blocked implementations that partition matrices into cache-friendly blocks.[15] Leading dimensions LDA, LDB, and LDC ensure flexibility for submatrix operations, with LDA ≥ m if TRANSA = 'N', or LDA ≥ k otherwise, to handle arbitrary matrix shapes.[16] Triangular matrix multiply (TRMM) extends GEMM-like functionality to cases involving triangular matrices, computing either B := α · op(A) · B (left side) or B := α · B · op(A) (right side), where A is upper or lower triangular as specified by the UPLO parameter.[17] The SIDE parameter ('L' or 'R') determines the multiplication order, and op(A) follows the same transpose options as GEMM, with A typically square (n × n) and B rectangular (m × n).[16] Like GEMM, TRMM benefits from block algorithms that pack triangular blocks to reduce memory traffic, achieving high efficiency in factorized matrix operations.[15] For symmetric and Hermitian matrices, which store only the upper or lower triangle to save space and exploit structure, dedicated routines avoid unnecessary computations on redundant elements. The symmetric matrix-matrix multiply (SYMM) and Hermitian counterpart (HEMM) compute C := α · A · B + β · C, where A is symmetric (real) or Hermitian (complex), again with SIDE options for left ('L', A m × m, B and C m × n) or right ('R', A n × n, B and C m × n) multiplication.[17] Rank-k updates for symmetric/Hermitian matrices, SYRK and HERK, perform C := α · op(A) · op(A)^T + β · C (or conjugate transpose for HERK), updating the n × n result C from an m × k input A, with transpose controlled by TRANS.[16] These have O(m n k + n^2 m) or similar complexity depending on dimensions, optimized via blocked packing to reuse data in cache.[15] Rank-2k updates extend this further: SYR2K and HER2K compute C := α · op(A) · op(B)^T + α · op(B) · op(A)^T + β · C for two input matrices A and B (both m × k), producing the symmetric/Hermitian n × n C, with UPLO specifying the triangle updated and TRANS the operation on A and B.[17] Leading dimensions follow analogous rules to GEMM, ensuring compatibility with subarrays.[16] Block-based implementations for these structured operations, as pioneered in high-performance libraries, minimize data movement by packing panels and micro-kernels tuned to hardware caches, enabling GEMM-derived efficiencies across the suite.[15]Interface Specifications
Data Types and Precisions
The Basic Linear Algebra Subprograms (BLAS) standard supports four fundamental data types, corresponding to single and double precisions for both real and complex numbers. These are denoted by the prefixes S for single-precision real, D for double-precision real, C for single-precision complex, and Z for double-precision complex in routine names.[10]| Type Prefix | Precision | Representation | Bit Width |
|---|---|---|---|
| S | Single real | 32-bit floating-point | 32 bits |
| D | Double real | 64-bit floating-point | 64 bits |
| C | Single complex | Two 32-bit floats (real/imag) | 64 bits |
| Z | Double complex | Two 64-bit floats (real/imag) | 128 bits |
Naming Conventions and Calling Sequences
The naming conventions for BLAS routines follow a standardized scheme that encodes the data type and operation within a compact 6- to 7-character identifier, ensuring portability across implementations. The routine name begins with a single-character prefix indicating the precision and type: 'S' for single-precision real, 'D' for double-precision real, 'C' for single-precision complex, and 'Z' for double-precision complex. This is followed by a 3- to 6-character operation code describing the specific computation, such as 'GEMM' for general matrix-matrix multiplication or 'AXPY' for vector scaling and addition. For example, SGEMM denotes single-precision general matrix-matrix multiplication.[10] Calling sequences adhere to Fortran 77 conventions, using subroutine calls with parameters passed by reference in column-major order. A typical template for a Level 1 routine like vector operations isSUBROUTINE <name>(N, ALPHA, X, INCX, Y, INCY), where N specifies the vector dimension, ALPHA is a scalar multiplier, X and Y are input/output arrays, and INCX/INCY denote strides for non-contiguous storage, allowing flexible access to array elements. For Level 3 routines, such as matrix-matrix operations, the sequence expands to include matrix dimensions and leading dimensions, as in SUBROUTINE SGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC), computing C ← α A B + β C with TRANSA/TRANSB indicating transpose options, M/N/K as dimensions, and LDA/LDB/LDC as leading dimensions for storage efficiency.[10][19]
BLAS interfaces maintain Fortran 77 compatibility with fixed-format source and no modules, facilitating legacy integration, while C bindings are provided through wrappers like the CBLAS standard, which prepend 'cblas_' to names (e.g., cblas_sgemm) and pass dimensions as integers to align with C's conventions, including an Order parameter to specify row- or column-major storage order. Error handling lacks built-in exceptions; instead, select routines like triangular solvers (e.g., STRSV) return an integer INFO parameter indicating success (INFO=0) or issues such as singular matrices (INFO>0), with callers responsible for checking.[10][19]
