Recent from talks
Contribute something
Nothing was collected or created yet.
| AWK | |
|---|---|
Usage of AWK in shell to check matching fields in two files | |
| Paradigm | Scripting, procedural, data-driven[1] |
| Designed by | Alfred Aho, Peter Weinberger, and Brian Kernighan |
| First appeared | 1977 |
| Stable release | IEEE Std 1003.1-2008 (POSIX) / 1985
|
| Typing discipline | none; can handle strings, integers and floating-point numbers; regular expressions |
| OS | Cross-platform |
| Major implementations | |
| awk, GNU Awk, mawk, nawk, MKS AWK, Thompson AWK (compiler), Awka (compiler) | |
| Dialects | |
| old awk oawk 1977, new awk nawk 1985, GNU Awk gawk | |
| Influenced by | |
| C, sed, SNOBOL[2][3] | |
| Influenced | |
| Tcl, AMPL, Perl, Korn Shell (ksh93, dtksh, tksh), Lua | |
AWK (/ɔːk/[4]) is a scripting language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter,[4] and it is a standard feature of most Unix-like operating systems.
The AWK language is a data-driven scripting language consisting of a set of actions to be taken against streams of textual data – either run directly on files or used as part of a pipeline – for purposes of extracting or transforming text, such as producing formatted reports. The language extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions. While AWK has a limited intended application domain and was especially designed to support one-liner programs, the language is Turing-complete, and even the early Bell Labs users of AWK often wrote well-structured large AWK programs.[5]
AWK was created at Bell Labs in the 1970s,[6] and its name is derived from the surnames of its authors: Alfred Aho (author of egrep), Peter Weinberger (who worked on tiny relational databases), and Brian Kernighan. The acronym is pronounced the same as the name of the bird species auk, which is illustrated on the cover of The AWK Programming Language.[7] When written in all lowercase letters, as awk, it refers to the Unix or Plan 9 program that runs scripts written in the AWK programming language.
History
[edit]According to Brian Kernighan, one of the goals of AWK was to have a tool that would easily manipulate both numbers and strings. AWK was also inspired by Marc Rochkind's programming language that was used to search for patterns in input data, and was implemented using yacc.[8]
As one of the early tools to appear in Version 7 Unix, AWK added computational features to a Unix pipeline besides the Bourne shell, the only scripting language available in a standard Unix environment. It is one of the mandatory utilities of the Single UNIX Specification,[9] and is required by the Linux Standard Base specification.[10]
In 1983, AWK was one of several UNIX tools available for Charles River Data Systems' UNOS operating system under Bell Laboratories license.[11]
AWK was significantly revised and expanded in 1985–88, resulting in the GNU AWK implementation written by Paul Rubin, Jay Fenlason, and Richard Stallman, released in 1988.[12] GNU AWK may be the most widely deployed version[13] because it is included with GNU-based Linux packages. GNU AWK has been maintained solely by Arnold Robbins since 1994.[12] Brian Kernighan's nawk (New AWK) source was first released in 1993 unpublicized, and publicly since the late 1990s; many BSD systems use it to avoid the GPL license.[12]
AWK was preceded by sed (1974). Both were designed for text processing. They share the line-oriented, data-driven paradigm, and are particularly suited to writing one-liner programs, due to the implicit main loop and current line variables. The power and terseness of early AWK programs – notably the powerful regular expression handling and conciseness due to implicit variables, which facilitate one-liners – together with the limitations of AWK at the time, were important inspirations for the Perl language (1987). In the 1990s, Perl became very popular, competing with AWK in the niche of Unix text-processing languages.
Structure of AWK programs
[edit]
AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.
— Alfred V. Aho[14]
An AWK program is a series of pattern action pairs, written as:
condition { action }
condition { action }
...
where condition is typically an expression and action is a series of commands. The input is split into records, where by default records are separated by newline characters so that the input is split into lines. The program tests each record against each of the conditions in turn, and executes the action for each expression that is true. Either the condition or the action may be omitted. The condition defaults to matching every record. The default action is to print the record. This is the same pattern-action structure as sed.
In addition to a simple AWK expression, such as foo == 1 or /^foo/, the condition can be BEGIN or END causing the action to be executed before or after all records have been read, or pattern1, pattern2 which matches the range of records starting with a record that matches pattern1 up to and including the record that matches pattern2 before again trying to match against pattern1 on subsequent lines.
In addition to normal arithmetic and logical operators, AWK expressions include the tilde operator, ~, which matches a regular expression against a string. As handy syntactic sugar, /regexp/ without using the tilde operator matches against the current record; this syntax derives from sed, which in turn inherited it from the ed editor, where / is used for searching. This syntax of using slashes as delimiters for regular expressions was subsequently adopted by Perl and ECMAScript, and is now common. The tilde operator was also adopted by Perl.
Commands
[edit]AWK commands are the statements that are substituted for action in the examples above. AWK commands can include function calls, variable assignments, calculations, or any combination thereof. AWK contains built-in support for many functions; many more are provided by the various flavors of AWK. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions.
The print command
[edit]The print command is used to output text. The output text is always terminated with a predefined string called the output record separator (ORS) whose default value is a newline. The simplest form of this command is:
print- This displays the contents of the current record. In AWK, records are broken down into fields, and these can be displayed separately:
print $1- Displays the first field of the current record
print $1, $3- Displays the first and third fields of the current record, separated by a predefined string called the output field separator (OFS) whose default value is a single space character
Although these fields ($X) may bear resemblance to variables (the $ symbol indicates variables in the usual Unix shells and in Perl), they actually refer to the fields of the current record. A special case, $0, refers to the entire record. In fact, the commands "print" and "print $0" are identical in functionality.
The print command can also display the results of calculations and/or function calls:
/regex_pattern/ {
# Actions to perform in the event the record (line) matches the above regex_pattern
print 3+2
print foobar(3)
print foobar(variable)
print sin(3-2)
}
Output may be sent to a file:
/regex_pattern/ {
# Actions to perform in the event the record (line) matches the above regex_pattern
print "expression" > "file name"
}
or through a pipe:
/regex_pattern/ {
# Actions to perform in the event the record (line) matches the above regex_pattern
print "expression" | "command"
}
Built-in variables
[edit]AWK's built-in variables include the field variables: $1, $2, $3, and so on ($0 represents the entire record). They hold the text or values in the individual text-fields in a record.
Other variables include:
NR: Number of Records. Keeps a current count of the number of input records read so far from all data files. It starts at zero, but is never automatically reset to zero.[15]FNR: File Number of Records. Keeps a current count of the number of input records read so far in the current file. This variable is automatically reset to zero each time a new file is started.[15]NF: Number of Fields. Contains the number of fields in the current input record. The last field in the input record can be designated by $NF, the 2nd-to-last field by $(NF-1), the 3rd-to-last field by $(NF-2), etc.FILENAME: Contains the name of the current input-file.FS: Field Separator. Contains the "field separator" used to divide fields in the input record. The default, "white space", allows any sequence of space and tab characters. FS can be reassigned with another character or character sequence to change the field separator.RS: Record Separator. Stores the current "record separator" character. Since, by default, an input line is the input record, the default record separator character is a "newline".OFS: Output Field Separator. Stores the "output field separator", which separates the fields when awk prints them. The default is a "space" character.ORS: Output Record Separator. Stores the "output record separator", which separates the output records when awk prints them. The default is a "newline" character.OFMT: Output Format. Stores the format for numeric output. The default format is "%.6g".
Variables and syntax
[edit]Variable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords, and cannot begin with a numeric digit. The operators + - * / represent addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other. It is optional to use a space in between if string constants are involved, but two variable names placed adjacent to each other require a space in between. Double quotes delimit string constants. Statements need not end with semicolons. Finally, comments can be added to programs by using # as the first character on a line, or behind a command or sequence of commands.
User-defined functions
[edit]In a format similar to C, function definitions consist of the keyword function, the function name, argument names and the function body. Here is an example of a function.
function add_three(number) {
return number + 3
}
This statement can be invoked as follows:
(pattern) {
print add_three(36) # Outputs '''39'''
}
Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, to indicate where the parameters end and the local variables begin.
Examples
[edit]Hello, World!
[edit]Here is the customary "Hello, World!" program written in AWK:
BEGIN {
print "Hello, world!"
exit
}
Print lines longer than 80 characters
[edit]Print all lines longer than 80 characters. The default action is to print the current line.
length($0) > 80
Count words
[edit]Count words in the input and print the number of lines, words, and characters (like wc):
{
words += NF
chars += length + 1 # add one to account for the newline character at the end of each record (line)
}
END { print NR, words, chars }
As there is no pattern for the first line of the program, every line of input matches by default, so the increment actions are executed for every line. words += NF is shorthand for words = words + NF.
Sum last word
[edit]{ s += $NF }
END { print s + 0 }
s is incremented by the numeric value of $NF, which is the last word on the line as defined by AWK's field separator (by default, white-space). NF is the number of fields in the current line, e.g. 4. Since $4 is the value of the fourth field, $NF is the value of the last field in the line regardless of how many fields this line has, or whether it has more or fewer fields than surrounding lines. $ is actually a unary operator with the highest operator precedence. (If the line has no fields, then NF is 0, $0 is the whole line, which in this case is empty apart from possible white-space, and so has the numeric value 0.)
At the end of the input, the END pattern matches, so s is printed. However, since there may have been no lines of input at all, in which case no value has ever been assigned to s, s will be an empty string by default. Adding zero to a variable is an AWK idiom for coercing it from a string to a numeric value. This results from AWK's arithmetic operators, like addition, implicitly casting their operands to numbers before computation as required. (Similarly, concatenating a variable with an empty string coerces from a number to a string, e.g., s "". Note, there is no operator to concatenate strings, they are just placed adjacently.) On an empty input, the coercion in { print s + 0 } causes the program to print 0, whereas with just the action { print s }, an empty line would be printed.
Match a range of input lines
[edit]NR % 4 == 1, NR % 4 == 3 { printf "%6d %s\n", NR, $0 }
The action statement prints each line numbered. The printf function emulates the standard C printf and works similarly to the print command described above. The pattern to match, however, works as follows: NR is the number of records, typically lines of input, AWK has so far read, i.e. the current line number, starting at 1 for the first line of input. % is the modulo operator. NR % 4 == 1 is true for the 1st, 5th, 9th, etc., lines of input. Likewise, NR % 4 == 3 is true for the 3rd, 7th, 11th, etc., lines of input. The range pattern is false until the first part matches, on line 1, and then remains true up to and including when the second part matches, on line 3. It then stays false until the first part matches again on line 5.
Thus, the program prints lines 1,2,3, skips line 4, and then 5,6,7, and so on. For each line, it prints the line number (on a 6 character-wide field) and then the line contents. For example, when executed on this input:
Rome Florence Milan Naples Turin Venice
The previous program prints:
1 Rome
2 Florence
3 Milan
5 Turin
6 Venice
Printing the initial or the final part of a file
[edit]As a special case, when the first part of a range pattern is constantly true, e.g. 1, the range will start at the beginning of the input. Similarly, if the second part is constantly false, e.g. 0, the range will continue until the end of input. For example,
/^--cut here--$/, 0
prints lines of input from the first line matching the regular expression ^--cut here--$, that is, a line containing only the phrase "--cut here--", to the end.
Calculate word frequencies
[edit]Word frequency using associative arrays:
BEGIN {
FS="[^a-zA-Z]+"
}
{
for (i=1; i<=NF; i++)
words[tolower($i)]++
}
END {
for (i in words)
print i, words[i]
}
The BEGIN block sets the field separator to any sequence of non-alphabetic characters. Separators can be regular expressions. After that, we get to a bare action, which performs the action on every input line. In this case, for every field on the line, we add one to the number of times that word, first converted to lowercase, appears. Finally, in the END block, we print the words with their frequencies. The line
for (i in words)
creates a loop that goes through the array words, setting i to each subscript of the array. This is different from most languages, where such a loop goes through each value in the array. The loop thus prints out each word followed by its frequency count. tolower was an addition to the One True awk (see below) made after the book was published.
Match pattern from command line
[edit]This program can be represented in several ways. The first one uses the Bourne shell to make a shell script that does everything. It is the shortest of these methods:
#!/bin/sh
pattern="$1"
shift
awk '/'"$pattern"'/ { print FILENAME ":" $0 }' "$@"
The $pattern in the awk command is not protected by single quotes so that the shell does expand the variable but it needs to be put in double quotes to properly handle patterns containing spaces. A pattern by itself in the usual way checks to see if the whole line ($0) matches. FILENAME contains the current filename. awk has no explicit concatenation operator; two adjacent strings concatenate them. $0 expands to the original unchanged input line.
There are alternate ways of writing this. This shell script accesses the environment directly from within awk:
#!/bin/sh
export pattern="$1"
shift
awk '$0 ~ ENVIRON["pattern"] { print FILENAME ":" $0 }' "$@"
This is a shell script that uses ENVIRON, an array introduced in a newer version of the One True awk after the book was published. The subscript of ENVIRON is the name of an environment variable; its result is the variable's value. This is like the getenv function in various standard libraries and POSIX. The shell script makes an environment variable pattern containing the first argument, then drops that argument and has awk look for the pattern in each file.
~ checks to see if its left operand matches its right operand; !~ is its inverse. A regular expression is just a string and can be stored in variables.
The next way uses command-line variable assignment, in which an argument to awk can be seen as an assignment to a variable:
#!/bin/sh
pattern="$1"
shift
awk '$0 ~ pattern { print FILENAME ":" $0 }' pattern="$pattern" "$@"
Or You can use the -v var=value command line option (e.g. awk -v pattern="$pattern" ...).
Finally, this is written in pure awk, without help from a shell or without the need to know too much about the implementation of the awk script (as the variable assignment on command line one does), but is a bit lengthy:
BEGIN {
pattern = ARGV[1]
for (i = 1; i < ARGC; i++) # remove first argument
ARGV[i] = ARGV[i + 1]
ARGC--
if (ARGC == 1) { # the pattern was the only thing, so force read from standard input (used by book)
ARGC = 2
ARGV[1] = "-"
}
}
$0 ~ pattern { print FILENAME ":" $0 }
The BEGIN is necessary not only to extract the first argument, but also to prevent it from being interpreted as a filename after the BEGIN block ends. ARGC, the number of arguments, is always guaranteed to be ≥1, as ARGV[0] is the name of the command that executed the script, most often the string "awk". ARGV[ARGC] is the empty string, "". # initiates a comment that expands to the end of the line.
Note the if block. awk only checks to see if it should read from standard input before it runs the command. This means that
awk 'prog'
only works because the fact that there are no filenames is only checked before prog is run! If you explicitly set ARGC to 1 so that there are no arguments, awk will simply quit because it feels there are no more input files. Therefore, you need to explicitly say to read from standard input with the special filename -.
Self-contained AWK scripts
[edit]On Unix-like operating systems self-contained AWK scripts can be constructed using the shebang syntax.
For example, a script that sends the content of a given file to standard output may be built by creating a file named print.awk with the following content:
#!/usr/bin/awk -f
{ print $0 }
It can be invoked with: ./print.awk <filename>
The -f tells awk that the argument that follows is the file to read the AWK program from, which is the same flag that is used in sed. Since they are often used for one-liners, both these programs default to executing a program given as a command-line argument, rather than a separate file.
Versions and implementations
[edit]AWK was originally written in 1977 and distributed with Version 7 Unix.
In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book The AWK Programming Language, published 1988, and its implementation was made available in releases of UNIX System V. To avoid confusion with the incompatible older version, this version was sometimes called "new awk" or nawk. This implementation was released under a free software license in 1996 and is still maintained by Brian Kernighan (see external links below).[citation needed]
Old versions of Unix, such as UNIX/32V, included awkcc, which converted AWK to C. Kernighan wrote a program to turn awk into C++; its state is not known.[16]
- BWK awk, also known as nawk, refers to the version by Brian Kernighan. It has been dubbed the "One True AWK" because of the use of the term in association with the book that originally described the language and the fact that Kernighan was one of the original authors of AWK.[7] FreeBSD refers to this version as one-true-awk.[17] This version also has features not in the book, such as
tolowerandENVIRONthat are explained above; see the FIXES file in the source archive for details. This version is used by, for example, Android, FreeBSD, NetBSD, OpenBSD, macOS, and illumos. Brian Kernighan and Arnold Robbins are the main contributors to a source repository for nawk: github.com ./onetrueawk /awk - gawk (GNU awk) is another free-software implementation and the only implementation that makes serious progress implementing internationalization and localization and TCP/IP networking. It was written before the original implementation became freely available. It includes its own debugger, and its profiler enables the user to make measured performance enhancements to a script. It also enables the user to extend functionality with shared libraries. Some Linux distributions include gawk as their default AWK implementation.[citation needed] As of version 5.2 (September 2022) gawk includes a persistent memory feature that can remember script-defined variables and functions from one invocation of a script to the next and pass data between unrelated scripts, as described in the Persistent-Memory gawk User Manual: www
.gnu ..org /software /gawk /manual /pm-gawk / - mawk is a very fast AWK implementation by Mike Brennan based on a bytecode interpreter.
- libmawk is a fork of mawk, allowing applications to embed multiple parallel instances of awk interpreters.
- awka (whose front end is written atop the mawk program) is another translator of AWK scripts into C code. When compiled, statically including the author's libawka.a, the resulting executables are considerably sped up and, according to the author's tests, compare very well with other versions of AWK, Perl, or Tcl. Small scripts will turn into programs of 160–170 kB.
- tawk (Thompson AWK) is an AWK compiler for Solaris, MS-DOS, OS/2, and Windows, sold by Thompson Automation Software (defunct).[19]
- Jawk is a project to implement AWK in Java, hosted on SourceForge.[20] Extensions to the language are added to provide access to Java features within AWK scripts (i.e., Java threads, sockets, collections, etc.).
- xgawk is a fork of gawk[21] that extends gawk with dynamically loadable libraries. The XMLgawk extension was integrated into the official GNU Awk release 4.1.0.
- QSEAWK is an embedded AWK interpreter implementation included in the QSE library that provides embedding application programming interface (API) for C and C++.[22]
- libfawk is a very small, function-only, reentrant, embeddable interpreter written in C
- BusyBox includes an AWK implementation written by Dmitry Zakharov. This is a very small implementation suitable for embedded systems.
- CLAWK by Michael Parker provides an AWK implementation in Common Lisp, based upon the regular expression library of the same author.[23]
- goawk is an AWK implementation in Go with a few convenience extensions by Ben Hoyt, hosted on Github.
The gawk manual has a list of more AWK implementations.[24]
Books
[edit]- Aho, Alfred V.; Kernighan, Brian W.; Weinberger, Peter J. (1988-01-01). The AWK Programming Language. New York, NY: Addison-Wesley. ISBN 0-201-07981-X. Retrieved 2017-01-22.
- Aho, Alfred V.; Kernighan, Brian W.; Weinberger, Peter J. (2023-09-06). The AWK Programming Language, Second Edition. Hoboken, New Jersey: Addison-Wesley Professional. ISBN 978-0-13-826972-2. Archived from the original on 2023-10-27. Retrieved 2023-11-03.
- Dougherty, Dale; Robbins, Arnold (1997-03-01). sed & awk (2nd ed.). Sebastopol, CA: O'Reilly Media. ISBN 1-56592-225-5. Retrieved 2009-04-16.
- Robbins, Arnold (2001-05-15). Effective awk Programming (3rd ed.). Sebastopol, CA: O'Reilly Media. ISBN 0-596-00070-7. Retrieved 2009-04-16.
- Robbins, Arnold (2000). Effective Awk Programming: A User's Guide for Gnu Awk (1.0.3 ed.). Bloomington, IN: iUniverse. ISBN 0-595-10034-1. Archived from the original on 12 April 2009. Retrieved 2009-04-16.
See also
[edit]References
[edit]- ^ Stutz, Michael (September 19, 2006). "Get started with GAWK: AWK language fundamentals" (PDF). developerWorks. IBM. Archived (PDF) from the original on 2015-04-27. Retrieved 2015-01-29.
[AWK is] often called a data-driven language -- the program statements describe the input data to match and process rather than a sequence of program steps
- ^ Andreas J. Pilavakis (1989). UNIX Workshop. Macmillan International Higher Education. p. 196.
- ^ Arnold Robbins (2015). Effective Awk Programming: Universal Text Processing and Pattern Matching (4th ed.). O'Reilly Media. p. 560.
- ^ a b James W. Livingston (May 2, 1988). "The Great awk Program is No Birdbrain". Digital Review. p. 91.
- ^ Raymond, Eric S. "Applying Minilanguages". The Art of Unix Programming. Case Study: awk. Archived from the original on July 30, 2008. Retrieved May 11, 2010.
The awk action language is Turing-complete, and can read and write files.
- ^ Aho, Alfred V.; Kernighan, Brian W.; Weinberger, Peter J. (September 1, 1978). Awk — A Pattern Scanning and Processing Language (Second Edition) (Technical report). Unix Seventh Edition Manual, Volume 2. Bell Telephone Laboratories, Inc. Retrieved February 1, 2020.
- ^ a b Aho, Alfred V.; Kernighan, Brian W.; Weinberger, Peter J. (1988). The AWK Programming Language. Addison-Wesley Publishing Company. ISBN 9780201079814. Retrieved 16 May 2015.
- ^ "UNIX Special: Profs Kernighan & Brailsford". Computerphile. September 30, 2015. Archived from the original on 2021-11-22.
- ^ "The Single UNIX Specification, Version 3, Utilities Interface Table". Archived from the original on 2018-01-05. Retrieved 2005-12-18.
- ^ "Chapter 15. Commands and Utilities". Linux Standard Base Core Specification 4.0 (Technical report). Linux Foundation. 2008. Archived from the original on 2019-10-16. Retrieved 2020-02-01.
- ^ The Insider's Guide To The Universe (PDF). Charles River Data Systems, Inc. 1983. p. 13.
- ^ a b c Robbins, Arnold (March 2014). "The GNU Project and Me: 27 Years with GNU AWK" (PDF). skeeve.com. Archived (PDF) from the original on October 6, 2014. Retrieved October 4, 2014.
- ^ Dougherty, Dale; Robbins, Arnold (1997). sed & awk (2nd ed.). Sebastopol, CA: O'Reilly. p. 221. ISBN 1-565-92225-5.
- ^ Hamilton, Naomi (May 30, 2008). "The A-Z of Programming Languages: AWK". Computerworld. Archived from the original on 2020-02-01. Retrieved 2008-12-12.
- ^ a b "Records". GAWK: Effective AWK Programming: A User's Guide for GNU Awk (5.3 ed.). September 2024. Retrieved 2025-01-24.
- ^ Kernighan, Brian W. (April 24–25, 1991). An AWK to C++ Translator (PDF). Usenix C++ Conference. Washington, DC. pp. 217–228. Archived (PDF) from the original on 2020-06-22. Retrieved 2020-02-01.
- ^ "FreeBSD's work log for importing BWK awk into FreeBSD's core". May 16, 2005. Archived from the original on September 8, 2013. Retrieved September 20, 2006.
- ^ "CSV Processing With gawk (using the gawk-csv extension)". gawkextlib. 2018. Archived from the original on 2020-03-25.
- ^ James K. Lawless (May 1, 1997). "Examining the TAWK Compiler". Dr. Dobb's Journal. Archived from the original on February 21, 2020. Retrieved February 21, 2020.
- ^ "Jawk at SourceForge". Archived from the original on 2007-05-27. Retrieved 2006-08-23.
- ^ "xgawk Home Page". Archived from the original on 2013-04-18. Retrieved 2013-05-07.
- ^ "QSEAWK at GitHub". GitHub. Archived from the original on 2018-06-11. Retrieved 2017-09-06.
- ^ "CLAWK at GitHub". GitHub. Archived from the original on 2021-08-25. Retrieved 2021-06-01.
- ^ "B.5 Other Freely Available awk Implementations". GAWK: Effective AWK Programming: A User's Guide for GNU Awk (5.3 ed.). September 2024. Retrieved 2025-01-24.
Further reading
[edit]- Andy Oram (May 19, 2021). "Awk: The Power and Promise of a 40-Year-Old Language". Fosslife. Retrieved June 9, 2021.
- Hamilton, Naomi (May 30, 2008). "The A-Z of Programming Languages: AWK". Computerworld. Retrieved 2008-12-12. – Interview with Alfred V. Aho on AWK
- Robbins, Daniel (2000-12-01). "Awk by example, Part 1: An intro to the great language with the strange name". Common threads. IBM DeveloperWorks. Retrieved 2009-04-16.
- Robbins, Daniel (2001-01-01). "Awk by example, Part 2: Records, loops, and arrays". Common threads. IBM DeveloperWorks. Retrieved 2009-04-16.
- Robbins, Daniel (2001-04-01). "Awk by example, Part 3: String functions and ... checkbooks?". Common threads. IBM DeveloperWorks. Archived from the original on 19 May 2009. Retrieved 2009-04-16.
- AWK – Become an expert in 60 minutes
- : pattern scanning and processing language – Shell and Utilities Reference, The Single UNIX Specification, Version 5 from The Open Group
- – Linux User Manual – User Commands from Manned.org
External links
[edit]- The Amazing Awk Assembler by Henry Spencer.
- "AWK (formerly) at Curlie". Curlie. Archived from the original on 2022-03-18.
- awklang.org The site for things related to the awk language
- Awk Community Portal at the Wayback Machine (archived 2016-04-03)
printf function, making it concise for one-liners or short scripts.[3] AWK's design emphasizes simplicity and efficiency for data-driven tasks, evolving from early UNIX tools like sed and grep to become a standard utility for textual data manipulation.[2]
Standardized in POSIX as a utility for executing programs specialized in textual data manipulation, AWK remains widely available across Unix-like systems.[3] Implementations include the original Bell Labs AWK, the enhanced "new awk" (nawk), and the feature-rich GNU AWK (gawk), which adds extensions like TCP/IP networking while maintaining backward compatibility.[4] Its enduring utility lies in rapid prototyping for tasks such as log analysis, column-based data processing, and generating summaries from large datasets, often invoked from the command line with options for file input and variable assignment.[4]
History
Origins at Bell Labs
AWK was developed in 1977 at AT&T Bell Laboratories by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger as a quick tool for data manipulation tasks in the Unix environment. The language emerged from the need for a simple, efficient way to process text files, allowing users to write short programs—often just one or two lines—for common operations like scanning and extracting information. This initial effort was driven by the authors' desire to create a utility that could handle both textual patterns and numerical computations seamlessly, addressing limitations in existing tools. The early implementation of AWK functioned primarily as a filter in Unix pipelines, enabling it to process streams of data line by line and perform actions based on specified patterns. It drew inspiration from ad hoc tools such as SNOBOL for advanced string processing and pattern matching, including features like associative arrays, as well as the text manipulation utilities sed and grep for their pattern-based editing and searching capabilities. AWK's design emphasized a pattern-action paradigm, where users could define conditions (patterns) and corresponding operations (actions), making it particularly suited for rapid prototyping of text-processing scripts without the overhead of full programming languages. AWK's first public release occurred in 1978 as part of Unix Version 7, marking its availability to a broader community of Unix developers and users.[5] This version included core features like regular expression matching, relational operators on fields and variables, and built-in arithmetic and string functions, allowing for straightforward information retrieval from files.[5] In its early days, AWK found immediate application in tasks such as generating reports from structured data and extracting specific information from log files or datasets, proving invaluable for everyday data-processing needs at Bell Labs. These use cases highlighted its strength in handling field-oriented text analysis, where input lines could be automatically split into variables for conditional processing and output formatting.Evolution and Standardization
In 1985, Peter J. Weinberger and Brian W. Kernighan released an enhanced version of AWK, commonly known as "new AWK" or nawk, which significantly expanded the language's capabilities to address user demands for more advanced programming features.[2] This update introduced user-defined functions, support for multiple input streams via the getline function, computed regular expressions, and a suite of new built-in functions including atan2, cos, exp, log, sin, rand, srand, and string manipulation tools like gsub, index, match, split, sprintf, sub, substr, tolower, and toupper.[2] Additionally, new control structures such as do-while loops and the delete statement for arrays were added, along with keywords like function and return, enhancing AWK's expressiveness for complex data processing tasks.[6] The publication of The AWK Programming Language in 1988 by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger further solidified the language's design and served as its definitive reference. Authored by AWK's original creators and published by Addison-Wesley, the book detailed the nawk dialect, providing comprehensive explanations of its syntax, semantics, and practical applications, which helped establish a consistent understanding and widespread adoption among programmers. By documenting the evolved features from the 1985 update, it bridged the gap between the original 1977 implementation and modern usage, influencing subsequent implementations and educational resources. AWK achieved formal standardization through the POSIX Command Language and Utilities specification in 1992 (IEEE Std 1003.2-1992), which defined a portable subset of the language based primarily on the 1985 nawk version to ensure interoperability across Unix-like systems.[7] This standard clarified ambiguities in earlier implementations, such as field splitting behavior with FS=" ", ARGC/ARGV handling, and the use of /dev/stdin for standard input, while mandating core features like the pattern-action paradigm and built-in functions for basic text processing.[7] Subsequent revisions, including POSIX.1-2001 and POSIX.1-2008 (which incorporated utilities into the base specifications), introduced minor refinements such as improved numeric-string conversions and additional command-line options for strict compliance, but preserved the core nawk foundation without major syntactic changes. Further updates in POSIX.1-2017 and POSIX.1-2024 continued this trend, with the latter specifying no explicit conversions between numbers and strings, enhancing portability by aligning AWK with evolving system interfaces and ensuring its reliability in diverse environments.[3][8]Fundamentals
Program Structure
An AWK program consists of a sequence of pattern-action pairs, where each pair specifies a condition under which a set of actions is performed, along with optional special blocks such as BEGIN and END for initialization and finalization tasks.[9] These components form the basic syntax, with rules typically separated by newlines and actions enclosed in curly braces.[9] Programs may also define user-defined functions to encapsulate reusable code, enhancing modularity.[10] The core of the program revolves around these pattern-action pairs, which drive the data processing logic.[9] AWK is invoked from the command line using the syntaxawk [options] 'program' [file ...], where the program is provided as a string enclosed in single quotes to prevent shell interpretation, and optional input files are specified afterward.[9] Key options include -F fs to define the field separator (overriding the default FS variable) and -f progfile to read the program from an external file instead of the command line.[9] For standalone scripts, a shebang line at the beginning of the file, such as #!/usr/bin/awk -f, enables direct execution as a script by specifying the interpreter and the -f option to load the program from the script itself.[10]
During execution, AWK reads input sequentially as records—by default, one per line—splits each record into fields using the field separator FS (which defaults to a single space, treating consecutive whitespace as one separator), and processes the fields through the program's rules.[9] The flow begins with any BEGIN blocks executed once for setup, followed by evaluation of each record against the patterns in the order they appear, executing matching actions, and concludes with END blocks run once after all input is consumed.[9] This line-by-line, field-oriented approach ensures efficient handling of structured text data.[9]
Pattern-Action Paradigm
The pattern-action paradigm forms the core of AWK programming, where each rule consists of an optional pattern followed by an action block, enabling selective processing of input records from files or standard input.[11] Developed by Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan at Bell Labs in the late 1970s, this mechanism allows AWK to scan input line by line, testing each record against patterns to determine when to execute corresponding actions. Patterns define conditions for matching, while actions specify operations on matched data, providing a concise way to filter and transform text streams.[12] Patterns in AWK can take several forms to match input records flexibly. Regular expression patterns, delimited by forward slashes, match lines containing substrings that conform to the specified regular expression syntax.[11] Relational expressions use comparison operators to evaluate conditions, such as inequalities or equalities involving fields (derived from splitting the input record by the field separator), variables, or constants.[12] More complex conditions arise from combining patterns using logical operators, including conjunction (&&), disjunction (||), and negation (!), allowing compound matching rules. Action blocks, enclosed in curly braces, contain one or more AWK statements executed only when the associated pattern matches the current input record; these statements may include assignments to variables, function calls, or output operations like printing.[12] If a pattern is omitted from a rule, the action applies to every input record processed.[11] Conversely, if an action is omitted, AWK defaults to printing the matching record unchanged. AWK also supports special patterns for initialization and cleanup outside the main input loop. The BEGIN pattern triggers its action block before any input records are read, ideal for setting initial variables or field separators.[12] The END pattern executes its action after all input has been processed, suitable for summarizing results or final output.[11] These blocks ensure predictable execution order in AWK programs.Language Elements
Variables and Expressions
In AWK, variables are implicitly declared upon first assignment and do not require explicit type specification, allowing them to hold either numeric or string values depending on context.[11][13] Numeric values are treated as floating-point numbers, with automatic conversion between integers and floats as needed, while strings are sequences of characters. Uninitialized variables default to the empty string, which is numerically equivalent to 0.[13] For example, the assignmentx = 5 sets x to the numeric value 5, and x = "hello" sets it to the string "hello".[14]
String conversion from numbers occurs implicitly in certain contexts, such as concatenation, or explicitly using the sprintf function with a format specifier like %g for general numeric output.[13] AWK performs automatic type coercion: a numeric string like "3.14" converts to a number when used in arithmetic, and numbers convert to strings via the output format OFMT (default "%g") or CONVFMT (default "%.6g") for precision control.[13] String concatenation is achieved by juxtaposing expressions, as in prefix = "file" suffix; filename = prefix "name" suffix, resulting in "filename".[11][14]
AWK provides special field variables for accessing input data: $0 represents the entire input record (line), $1 through $NF denote individual fields separated by the field separator FS (default whitespace), and NF holds the number of fields in the current record.[13] Assigning to $n (e.g., $2 = "newvalue") modifies the field and updates $0 accordingly; assigning beyond NF extends the record and increases NF.[11] For instance, { print $1, $NF } outputs the first and last fields of each line.[14]
Expressions in AWK support arithmetic operations including addition (+), subtraction (-), multiplication (*), division (/), modulus (%), and exponentiation (^), following standard precedence with left-to-right associativity for equal precedence.[13] Compound assignments like +=, -=, etc., are also available. An example is { total += $3 * $4 }, which accumulates the product of the third and fourth fields.[11]
Relational operators (<, <=, ==, !=, >, >=) compare numbers numerically or strings lexicographically, with automatic type conversion where possible, while pattern-matching operators (~, !~) test strings against regular expressions.[13] Logical operators include negation (!), conjunction (&&), and disjunction (||), with short-circuit evaluation. For example, if ($1 > 10 && $2 != "error") print $0 prints lines where the first field exceeds 10 and the second is not "error".[14]
Built-in Functions and Variables
AWK provides a set of predefined variables that store information about the input data, processing state, and command-line arguments, enabling scripts to access metadata without explicit user code. These built-in variables are automatically maintained by the AWK interpreter and can be read or modified as needed.[3] Key built-in variables include NR, which holds the total number of input records processed so far, starting from 1 and incrementing with each record read. FNR tracks the record number within the current input file, resetting to 1 when a new file is opened. FILENAME contains the name of the current input file being processed. Field separators are managed by FS (input field separator, defaulting to any sequence of whitespace) and OFS (output field separator, defaulting to a single space). Command-line handling is supported by ARGC, the number of arguments passed to the AWK program, and ARGV, an array of those arguments indexed from 0 to ARGC-1, where ARGV is the program name and subsequent elements are input files or options. Other notable variables are NF (number of fields in the current record), ORS (output record separator, defaulting to newline), and RS (input record separator, defaulting to newline).[3] For string manipulation, AWK includes several built-in functions that operate on text data. The length() function returns the number of characters in its argument string (or $0 if none provided). substr(s, m, n) extracts a substring from string s starting at position m (1-based index) with length n, or to the end if n is omitted. index(s, t) finds the first occurrence of substring t in s and returns its 1-based position, or 0 if not found. Case conversion is handled by tolower(s), which returns s with all uppercase letters changed to lowercase, and toupper(s), which does the opposite. These functions facilitate common text processing tasks, such as extracting portions of fields or normalizing case. For example, to print the length of each line:{ print length($0) }.[3]
Mathematical operations are supported through arithmetic built-in functions that perform standard computations. int(x) truncates its numeric argument x toward zero to yield an integer value. sqrt(x) computes the square root of x. exp(x) returns e raised to the power of x, while log(x) yields the natural logarithm of x. The rand() function generates a pseudo-random number between 0 (inclusive) and 1 (exclusive); it is seeded automatically but can be influenced by the srand() function. These enable numerical analysis within AWK scripts, such as calculating distances or scaling values. For instance, to compute the square root of the first field: { print sqrt($1) }. Note that AWK treats unquoted numbers as numeric and performs automatic type conversion.[3]
Input/output operations are enhanced by dedicated built-in functions for reading, writing, and controlling streams. getline reads the next input record from the standard input (or a specified file or command) into the variable $0, updating NF, NR, and FNR; variants include getline var to store in a user variable or command | getline for piping output from a shell command. close(expression) closes a file or pipe opened via redirections or getline, preventing resource leaks in loops. printf(fmt, expr, ...) formats and outputs values according to a format string fmt, similar to C's printf, supporting specifiers like %s for strings and %d for integers; it does not add a newline by default. These functions allow flexible data ingestion and precise output control beyond the simple print statement. An example using getline from a file: while ((getline < "data.txt") > 0) print $1.[3]
Control Flow Statements
AWK provides a set of control flow statements that enable conditional execution and repetition within action blocks, allowing programs to implement complex logic based on data patterns. These constructs are derived from the original implementation by Aho, Weinberger, and Kernighan in 1977, with standardization in POSIX ensuring portability across Unix-like systems.[11][3] The if-else statement evaluates a condition and executes one of two possible statements. Its syntax isif (expression) statement [else statement], where the expression must evaluate to a non-zero or non-null value to be considered true. If the condition is true, the first statement executes; otherwise, if present, the else statement executes. This construct supports nested conditions for multi-way branching when chained. For example:
if ($1 > 0) {
print "Positive"
} [else](/page/The_Else) {
print "Non-positive"
}
if ($1 > 0) {
print "Positive"
} [else](/page/The_Else) {
print "Non-positive"
}
while (expression) statement, repeatedly executes the statement as long as the expression is true, checking the condition before each iteration. The do-while loop, do statement while (expression), executes the statement at least once before evaluating the condition, making it suitable for post-check scenarios. Both loops support compound statements enclosed in braces for multiple actions. The while loop appeared in the original AWK, while do-while was added in later standardized versions.[3][11]
The for loop offers two forms for iteration. The traditional form, for (expression1; expression2; expression3) statement, initializes with expression1, checks expression2 before each iteration, and increments with expression3 after. Omitting parts defaults to while-like behavior. The array form, for (variable in [array](/page/Array)) statement, iterates over array indices, assigning each to the variable sequentially. This is particularly useful for processing associative arrays without knowing their size in advance. Both forms have been core to AWK since the original design, enhancing efficiency in data manipulation tasks.[3][11]
AWK includes statements to alter loop execution: break, continue, and next. The break statement exits the innermost enclosing while, do-while, or for loop immediately. The continue statement skips the rest of the current iteration and proceeds to the next one in the innermost loop. The next statement terminates processing of the current input record, skips any remaining actions or patterns for it, and advances to the next record, effectively restarting the main loop. These were introduced in the original AWK to provide fine-grained control without full program exit.[3][11]
Although not part of the POSIX standard, the switch statement is available in GNU AWK (gawk) as an extension for multi-way selection. Its syntax is switch (expression) { case value: statements; ... [default: statements] }, where cases are checked in order for exact matches, and execution falls through until a break or the block ends. A default case handles unmatched values. This construct, inspired by C, improves readability for multiple discrete conditions but requires gawk-specific invocation for portability.[15]
Input and Output
AWK processes input data by reading from various sources and dividing it into records and fields for manipulation. Input can come from standard input (stdin), explicitly named files provided as command-line arguments, or pipes from preceding commands in a shell pipeline. When files are specified on the command line, they are placed in the ARGV array (excluding the program name and options), and AWK reads them sequentially, treating each non-empty ARGV element as a filename; the special value "-" denotes standard input. Modifications to ARGV during execution can alter the input sources dynamically, allowing flexible control over file processing.[3] Records in AWK are the fundamental units of input, separated by the record separator defined in the built-in variable RS, which defaults to a single newline character, treating each line as a record. The entire record is available in the variable $0, while fields within it are split by the field separator FS (default: whitespace). Setting RS to an empty string ("") changes the behavior to treat sequences of one or more blank lines (newlines followed by empty lines) as the separator, enabling paragraph-mode processing where each block of non-blank lines forms a record; this is useful for handling unstructured text like documents. In POSIX AWK, multi-character values for RS use only the first character as the record separator, while GNU AWK supports the full string or regular expressions for more precise delimiting.[3][16] Output in AWK is produced using the print statement for simple, unformatted emission or printf for controlled formatting akin to C's printf. The print statement outputs its arguments (or $0 if none) separated by the output field separator OFS (default: space) and terminated by the output record separator ORS (default: newline), directing results to standard output by default. In contrast, printf uses a format string to specify exact layout, such as alignment and precision, without automatic separators or terminators, making it ideal for tabular or numeric displays. Both statements support redirection to alter destinations.[3][17] AWK employs automatic buffering for output efficiency, where data is accumulated in buffers before writing to destinations like files or pipes; this is particularly noticeable in non-interactive modes or when output is redirected, potentially delaying visibility until the buffer fills or the program ends. The fflush() function forces immediate flushing of buffers for a specified file, pipe, or all outputs (when called without arguments), ensuring timely delivery in scenarios like real-time processing or pipelined commands. Buffering behavior can be controlled in GNU AWK via the PROCINFO array, such as setting PROCINFO["BUFFERPIPE"] to disable line buffering for pipes. Redirection extends output flexibility by allowing print or printf to target files or external commands instead of standard output. The operator > followed by a filename opens (or truncates if existing) the file for writing, with subsequent uses appending to it; >> explicitly appends without truncation, creating the file if absent. For piping, | followed by a command string sends output to that command via a one-way pipe, invoking it through a system call like popen(); the pipe remains open until explicitly closed with close() or the program terminates. These operations evaluate the redirection expression to a string pathname or command, and multiple redirections can be active simultaneously, limited only by system resources in POSIX-compliant implementations.[3][18] GNU AWK extends POSIX with two-way I/O for coprocesses, enabling bidirectional communication via the |& operator. This creates a pair of pipes to a subprocess, allowing output with print |& "command" and input via "command" |& getline, facilitating interactive or parallel processing like sorting or external computations. As a non-POSIX extension, it requires GNU AWK and may involve buffering considerations to avoid deadlocks, with dedicated close() modes ("to" for output, "from" for input) to terminate connections properly.[19]Advanced Topics
User-Defined Functions
AWK allows users to define custom functions to promote code modularity and reusability within programs. These functions can encapsulate specific logic, accept parameters, and return values, enabling more structured scripting similar to procedural programming languages. User-defined functions are invoked like built-in ones and can appear anywhere in the AWK program where statements are allowed, though they are typically placed before their first use for clarity.[8][20] The syntax for defining a user-defined function follows this form:function name([parameter-list])
{
body-of-function
}
function name([parameter-list])
{
body-of-function
}
name is the function's identifier, which must begin with a letter or underscore and consist of letters, digits, and underscores. The optional parameter-list includes comma-separated argument names followed by any local variable names, often separated by a comment for readability (e.g., function name(arg1, arg2 # locals: local1, local2)). The body-of-function contains AWK statements that execute when the function is called, and a return statement can optionally specify a value to return; without it, the function returns zero or the empty string depending on context. POSIX AWK requires the keyword function and disallows using predefined variable names (like FS) or other function names as parameters.[8][20]
Variable scope in user-defined functions ensures locality for parameters and declared locals, which shadow any global variables of the same name during execution but do not affect the globals afterward. Parameters and locals are initialized to the empty string (or zero in numeric contexts) if unassigned and exist only for the function's duration. All other variables in the program remain accessible as globals within the function body. Control flow statements like if, while, and for can be used inside functions to direct execution.[20]
Recursion is supported in AWK, allowing a function to call itself, either directly or indirectly through other functions, with the call stack managing nested invocations up to implementation limits. Function calls can be nested, and recursive calls enable solutions to problems like tree traversals or factorial computations.[8][21]
Arguments to user-defined functions are passed by value for scalars, meaning copies of the values are made and modifications within the function do not affect the caller's variables. However, if an array name is passed as a parameter, it is passed by reference, allowing the function to modify the original array. This distinction facilitates efficient handling of complex data while protecting simple values.[8][21]
For example, consider a simple function to compute the absolute value of a number, which can be invoked in a pattern-action rule:
function abs(num) {
if (num < 0)
return -num
else
return num
}
{ print abs($1) }
function abs(num) {
if (num < 0)
return -num
else
return num
}
{ print abs($1) }
num, performs a conditional calculation, and returns the result. When applied to input lines, it outputs the absolute value of the first field for each record, demonstrating how user-defined functions integrate with AWK's core processing paradigm.[20][22]
Arrays and Data Structures
In AWK, arrays serve as the primary data structure for dynamic storage and manipulation of data, functioning as associative arrays where elements are stored and retrieved using string-based indices. Unlike traditional arrays in other languages that require fixed sizes or numeric indices, AWK arrays are implicitly declared and grow dynamically as elements are added, with no need for explicit initialization or dimension specification.[3][23] Arrays in AWK are inherently associative, meaning indices can be any string expression, including literals, variables, or computed values; numeric indices are automatically converted to their string representations for storage and access. For example, assigningarr[1] = "foo" stores the value under the index "1", allowing flexible key-value pairing suitable for tasks like counting occurrences or mapping data. Iteration over array elements occurs via a for loop construct, such as for (var in arr), which traverses all indices in an unspecified order, enabling processing of associative data without predefined structure.[3][23]
Multidimensional arrays are simulated in AWK by concatenating indices into a single string using the built-in variable SUBSEP (default value implementation-defined, often a comma followed by a non-printable character), so arr[1, "subkey"] is equivalent to arr[1 SUBSEP "subkey"]. This approach allows emulation of higher dimensions without native support for true nested arrays in standard AWK, though GNU Awk (gawk) extends this with capabilities for arrays of arrays. To remove elements, the delete statement is used, as in delete arr[idx], which clears a specific entry or, when applied in a loop over all indices, empties the entire array; omitting the index deletes all elements.[3][23]
GNU Awk introduces additional array functions beyond POSIX standards, including length(arr) to return the number of elements in the array and sorting functions asort(arr) and asorti(arr). The asort function sorts the values of arr into a new array (or in place if specified), returning the number of elements, while asorti sorts the indices themselves, useful for ordered traversal of associative keys. These extensions enhance AWK's utility for data organization tasks requiring enumeration or sequencing.[23]
Regular Expressions and Patterns
In AWK, regular expressions provide a powerful mechanism for pattern matching, drawing from the Extended Regular Expressions (ERE) defined in the POSIX standard. These expressions describe sets of strings and are integral to selecting and manipulating text data. AWK implements ERE with C-style escape conventions, supporting internationalization and features like interval expressions for repetition.[3][24] Regular expressions in AWK are most commonly specified as literals enclosed in forward slashes, denoted as/pattern/, which matches any input record whose text belongs to the set defined by the pattern. For instance, /foo/ matches any record containing the substring "foo". The syntax incorporates standard metacharacters: the period . matches any single character except the null character, ^ anchors the match to the beginning of the string, $ to the end, | enables alternation between alternatives, parentheses () group subexpressions, and square brackets [] define character classes to match any one of a specified set of characters. To treat these metacharacters literally, they are escaped with a backslash, such as \. for a literal period, \^ for a literal caret, or \( for a literal parenthesis.[3][24]
Within bracket expressions, AWK supports POSIX character classes for more portable and locale-aware matching, using the notation [[:class:]]; examples include [[:alpha:]] for alphabetic characters, [[:digit:]] for decimal digits, [[:space:]] for whitespace, and [[:punct:]] for punctuation. These classes enhance readability and adaptability across different character encodings, as defined in the POSIX ERE specification. Bracket expressions also allow negated classes with [^ ] and ranges like [a-z]. AWK's ERE support extends to repetition operators such as * for zero or more, + for one or more, and ? for zero or one, as well as bounded repetition {m,n}.[24][3]
To test whether a string or field matches a regular expression, AWK employs the binary operators ~ (matches) and !~ (does not match), which return 1 for true and 0 for false; for example, $1 ~ /pattern/ evaluates to true if the first field matches the pattern. A regex literal /regex/ serves as a shorthand for a pattern that matches the entire input record, usable directly in conditional contexts. In the pattern-action paradigm, such expressions select records to trigger associated actions.[3]
AWK provides built-in variables to access details of regex matches. The RSTART variable stores the 1-based index of the first character of the matched substring, while RLENGTH holds the length of that substring in characters; if no match occurs, RSTART is set to 0 and RLENGTH to -1. These variables are automatically updated by the match() function and are part of the POSIX standard. In GNU Awk (gawk), an extension variable RT captures the exact text that matched the record separator RS when it is defined as a regular expression, facilitating precise record boundary handling.[3][25]
The match(string, ere) function searches the specified string for the longest leftmost substring matching the extended regular expression ere, returning the 1-based position of the match or 0 if none is found, and it sets RSTART and RLENGTH accordingly. For text replacement, sub(ere, replacement [, target]) substitutes the first non-overlapping occurrence of the regex ere in the optional target string (defaulting to $0) with replacement, where & in the replacement represents the matched text; it returns 1 if a substitution occurred or 0 otherwise. The global variant gsub(ere, replacement [, target]) performs substitutions on all non-overlapping matches, returning the total number performed. Both functions adhere to POSIX ERE semantics.[3][26]
Additionally, the split(string, array [, fieldsep]) function divides string into elements of array using fieldsep as a regular expression delimiter (defaulting to the field separator FS if omitted), returning the number of array elements created; empty fields are included unless the delimiter matches the empty string at the start or end. In gawk, an optional fourth argument seps can store the matched separators in another array, providing finer control over parsing. This regex-based splitting supports complex tokenization tasks within AWK's ERE framework.[3][26]
Practical Examples
Basic Scripts
AWK's basic scripts demonstrate its core pattern-matching and action capabilities through simple, self-contained programs that process text input line by line. These introductory examples highlight how AWK can execute actions unconditionally, based on patterns, or prior to input processing, making it accessible for quick text manipulation tasks.[27] A fundamental "Hello World" program in AWK uses the BEGIN pattern to print a message before any input is read. The scriptBEGIN { print "Hello, World!" } outputs "Hello, World!" to standard output without requiring input files, illustrating AWK's ability to run initialization code independently of data processing.[28]
To print specific fields from input lines, AWK relies on its default field-splitting behavior, where whitespace separates fields into numbered variables like $1 for the first field and NF for the last field. For instance, the script `{ print $1, NF }` processes each line of input and outputs the first and last fields separated by a space, useful for extracting key elements from structured text such as logs or delimited files.[29]
Simple filtering employs regular expression patterns to select lines for processing. The script /pattern/ { print } matches lines containing "pattern" and prints the entire line ($0), providing an efficient way to grep-like search without external tools. For example, /error/ { print } would output only lines with the word "error".[30]
AWK supports command-line patterns for immediate execution as one-liners, such as awk '/error/' filename, which directly applies the pattern and action to the specified file without needing a script file. This contrasts with full scripts saved in files (e.g., using awk -f script.awk), which allow for multi-line programs and reusability across multiple inputs, while one-liners suit ad-hoc queries.[31]
Common Text Processing Tasks
AWK is widely used for common text processing tasks in Unix-like environments, such as filtering lines based on length, counting elements in input, performing simple arithmetic on columns, matching patterns, and selecting portions of files, leveraging its pattern-action paradigm for efficient stream processing.[3] These tasks often rely on built-in variables like NR for the current record number and NF for the number of fields in the current record.[32] One frequent operation is printing lines longer than a specified length, such as 80 characters, which helps identify overly verbose entries in logs or documents. The AWK programlength($0) > 80 { print } achieves this by evaluating the length of each input line ($0 represents the entire line) and printing those that exceed the threshold.[33] This approach is particularly useful for enforcing formatting standards in text files.[3]
Counting lines and words in a file is another staple task, providing quick summaries of document structure. To count total lines and words (treating whitespace-separated fields as words), the script { words += NF } END { print NR " lines, " words " words" } accumulates the field count per line and reports the totals after processing all input.[29] Here, NR tracks the overall line count, while the summation of NF yields the word total.[3]
Summing values in a specific column, such as numeric data in reports, demonstrates AWK's arithmetic capabilities for basic aggregation. For instance, { sum += $2 } END { print sum } adds the second field's value ($2) for each line and outputs the result at the end of input processing.[34] This is commonly applied to tasks like totaling sales figures from delimited files.[3]
Pattern-based line selection mimics tools like grep, enabling selective output without external dependencies. The command /pattern/ { print $0 } prints entire lines ($0) that match the regular expression "pattern," such as /error/ to extract error messages from logs.[35] POSIX AWK supports standard regular expressions for such matching, ensuring portability across systems.[3]
Simulating head and tail functionality allows extracting the first or last few lines of a file for previews or summaries. For the first 10 lines, use NR <= 10 { print }; for the last 10, a two-step approach first determines the total lines with total = NR in an initial pass (via wc -l or similar), then NR > total - 10 { print } in a second invocation.[32] This method keeps processing linear for large files when the total is precomputed.[3]
Complex Data Analysis
AWK supports complex data analysis by leveraging associative arrays for aggregation tasks, such as computing word frequencies from textual input. In this approach, each unique word serves as an index in the array, with its value representing the occurrence count, allowing efficient summarization without external storage. For instance, the following script processes input lines, normalizes words by converting to lowercase and removing punctuation, and increments counts in thefreq array for each field:
{
$0 = tolower($0) # remove case distinctions
gsub(/[^[:alnum:]_[:blank:]]/, "", $0) # remove punctuation
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
{
$0 = tolower($0) # remove case distinctions
gsub(/[^[:alnum:]_[:blank:]]/, "", $0) # remove punctuation
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
END block, facilitating tasks like text statistics or lexicon building.[36]
Processing data from multiple files enables cross-source aggregation in AWK, where the built-in FILENAME variable tracks the current input file, allowing scripts to detect transitions and accumulate results like totals or summaries. For example, to sum a numeric value from a specific field across files while noting the source, a script might use FILENAME in an action to append file-specific identifiers to running totals stored in arrays. GNU AWK extends this with BEGINFILE and ENDFILE patterns for per-file initialization and finalization, such as resetting counters or printing file-specific subtotals before aggregating globally in END; this is particularly useful for distributed log analysis or merging datasets.[37]
Range patterns in AWK facilitate extracting and analyzing contiguous sections of input, matching records from a beginning pattern until an ending one, which is ideal for processing structured documents like reports or logs. The syntax /start/, /end/ { print } selects all lines starting from the first matching /start/ until the first subsequent /end/, including both delimiters, enabling targeted analysis of bounded data segments without manual line tracking. This mechanism turns on upon encountering the start pattern and remains active until the end pattern, supporting scenarios like isolating error blocks in system outputs for further aggregation.[38]
For CSV-like data, AWK handles comma-separated values by setting the field separator to a comma via the -F"," option or BEGIN { FS = "," }, which splits records into fields for numerical or statistical computations. Calculations can then operate on these fields, such as summing values in a column: { total += $2 } END { print total } accumulates the second field's numeric content across records, useful for deriving aggregates like averages or totals from tabular data without loading entire datasets into memory. Associative arrays can store these results keyed by categories in other fields, enhancing multidimensional analysis.[39]
GNU AWK's asort() function enhances report generation by sorting associative arrays by value, producing ordered outputs for summarized data. After aggregation, such as populating an array with metrics, asort(dest) copies and sorts the values into a sequentially indexed array dest, allowing traversal from lowest to highest for ranked reports. For example:
BEGIN {
data["jan"] = 15; data["feb"] = 10; data["mar"] = 20
n = asort(data, sorted)
for (i = 1; i <= n; i++)
print i, sorted[i]
}
BEGIN {
data["jan"] = 15; data["feb"] = 10; data["mar"] = 20
n = asort(data, sorted)
for (i = 1; i <= n; i++)
print i, sorted[i]
}
Implementations
Original and POSIX AWK
The original AWK was developed in 1977 at Bell Laboratories by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger as a text-processing tool integrated into the Unix operating system, particularly System V Unix. It featured a simple syntax centered on patterns and actions for scanning input lines, with built-in variables such as$0 for the entire record, $n for fields, NF for the number of fields, and NR for the record number, alongside basic arithmetic operators and control structures like if and loops. The language included a limited set of built-in functions, such as length for string size, index for substring position, substr for extraction, and a basic split function that divided strings on single characters without support for regular expressions or advanced options like strftime for time formatting.[2]
POSIX AWK, standardized in IEEE Std 1003.1 (first published in 1988 and updated in subsequent revisions), defines a minimal, portable subset of the language required for compliance, building on the original while mandating specific features for consistency across Unix-like systems. It requires arithmetic functions including atan2(y, x) for arctangent, cos and sin for trigonometry, exp and log for exponentials, sqrt for square root, int for truncation, and rand/srand for random numbers; string functions such as gsub and sub for global/substitution with regular expressions, index and match for searching, length for size, split for field division (now supporting extended regular expressions as separators), sprintf for formatting, substr for extraction, and tolower/toupper for case conversion; and I/O functions like close for file handles, system for command execution, and various forms of getline for input control. Key variables include CONVFMT (defaulting to "%.6g" for converting numbers to strings during arithmetic operations), FS for the input field separator (defaulting to whitespace and interpretable as a single character or extended regular expression), OFS for output field separator (default space), ORS for output record separator (default newline), RS for input record separator (default newline), and others like NF, NR, FILENAME, and SUBSEP for array indexing. POSIX explicitly prohibits non-standard extensions, such as the delete statement on scalar variables (allowed only on arrays), to ensure predictable behavior.[3]
AWK implementations adhering to the original or POSIX specifications are widely available on Unix-like systems, promoting portability for scripts that avoid vendor-specific features; however, subtle differences persist in FS handling, such as how multiple consecutive whitespace characters or null fields are treated when FS is set to a regular expression, though POSIX-compliant versions standardize whitespace as one or more spaces/tabs without creating empty fields between them.[3]
Both original and POSIX AWK exhibit key limitations suited to their era and design focus on text processing: they provide no built-in support for networking operations like socket connections or protocol handling, relying instead on external system calls for such needs, and employ fixed-precision floating-point arithmetic (typically IEEE 754 double precision with about 15 decimal digits of accuracy) without options for arbitrary precision or extended numerical range.[3] In 1985, the nawk implementation enhanced the original by adding user-defined functions and support for multiple input streams, influencing the POSIX baseline.[2]
GNU Awk (gawk)
GNU Awk, commonly known as gawk, is the GNU Project's implementation of the AWK programming language, designed to be fully compatible with the POSIX standard while incorporating numerous extensions for enhanced functionality.[2] Development of gawk began in 1986, initiated by Paul Rubin with contributions from Jay Fenlason, who completed the initial implementation, and advice from Richard Stallman; additional code was provided by John Woods.[2] The project has evolved continuously, with version 5.3.2 released on April 6, 2025, introducing refinements to existing features and bug fixes.[41] A notable addition in version 5.2.0, released in September 2022, is the persistent memory feature, which allows storage of variables, arrays, and user-defined functions in a file for reuse across script invocations, simplifying stateful scripting and potentially improving performance in iterative tasks.[42] Gawk extends the core AWK language with several powerful features not found in the POSIX specification. Time-related functions, such asmktime(), enable conversion between textual date representations and timestamps, supporting operations like date comparisons and adjustments; this function has been available since early versions, with enhancements like an optional second argument added in version 2.13. The @include directive, introduced in version 4.0, facilitates modular programming by allowing inclusion of external AWK source files, equivalent to the -i command-line option for loading libraries.[43] Debugging support, enhanced in version 4.0 with a rewritten debugger accessible via the -D flag, permits stepping through code, setting breakpoints, and inspecting variables during execution. For numerical precision, gawk integrates the MPFR library starting from version 4.1.0, providing arbitrary-precision floating-point arithmetic with configurable precision and rounding modes, accessible through built-in functions like printf() with %F format.[44] Other extensions include the switch statement for structured control flow (enabled by default in 4.0) and multidimensional arrays via arrays of arrays (also from 4.0), which support complex data structures beyond one-dimensional indexing.[45]
Performance in gawk has been optimized for handling large files and datasets, with compiler-like optimizations enabled by default since version 4.2, including instruction scheduling and common subexpression elimination to reduce execution time; these can be disabled with --no-optimize if needed.[45] Its buffered I/O and efficient pattern matching make it suitable for processing gigabyte-scale inputs without excessive memory usage. Gawk is available as a standalone GNU package, portable across Unix-like systems, and supports Windows through environments like Cygwin and MSYS2, ensuring broad cross-platform compatibility.[4]
Alternative Implementations
Several alternative implementations of AWK exist, tailored for specific environments such as performance optimization, portability, or resource-constrained systems, diverging from the dominant GNU Awk in focus and features.[46] Mawk, developed by Mike Brennan, is a lightweight interpreter emphasizing speed for processing large datasets, often outperforming other AWK variants in text manipulation tasks due to its byte-code interpretation approach.[47][48] It adheres strictly to POSIX standards without proprietary extensions, making it suitable for environments requiring standard compliance and minimal overhead.[6] Maintenance shifted to Thomas E. Dickey in 2009, with updates continuing into the 2020s to incorporate fixes from distributions like Debian.[47] Nawk represents the original "new AWK" enhancements introduced in the mid-1980s for BSD Unix and Plan 9, extending the core language with functions likestrftime for time formatting while maintaining compatibility with earlier AWK scripts.[49][6] Its source code, derived from Bell Labs developments, is available on platforms like GitHub for ports and builds, supporting modern Unix-like systems.[50][51]
For embedded and resource-limited systems, BusyBox includes a compact AWK implementation optimized for minimal footprint, providing core pattern-matching and text-processing capabilities within a single executable that bundles multiple Unix utilities.[52] This version prioritizes size over completeness, omitting some advanced features to fit constrained environments like IoT devices, while still handling basic POSIX AWK operations efficiently.[53][46]
Java-based implementations like Jawk enhance portability by running AWK scripts within the Java Virtual Machine, allowing seamless integration into cross-platform applications without native dependencies.[54] Originally developed by John D. A. Thompson, Jawk supports standard AWK syntax plus Java extensions for object access, and variants like those from Hoijui maintain active development for embedding in Java projects.[55][56]
These alternatives exhibit compatibility variations, particularly in extensions; for instance, mawk lacks support for coprocesses (two-way pipes), a feature absent in POSIX AWK but present in some enhanced variants, which can affect scripts relying on bidirectional communication.[57][58] Overall, they ensure broad script portability for standard tasks while trading advanced capabilities for efficiency or specialization.[46]References
- https://wiki.alpinelinux.org/wiki/Awk