Recent from talks
Contribute something
Nothing was collected or created yet.
C string handling
View on Wikipedia
| C standard library (libc) |
|---|
| General topics |
| Miscellaneous headers |
The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.
The only support for strings in the programming language proper is that the compiler translates quoted string constants into null-terminated strings.
Definitions
[edit]A string is defined as a contiguous sequence of code units terminated by the first zero code unit (often called the NUL code unit).[1] This means a string cannot contain the zero code unit, as the first one seen marks the end of the string. The length of a string is the number of code units before the zero code unit.[1] The memory occupied by a string is always one more code unit than the length, as space is needed to store the zero terminator.
Generally, the term string means a string where the code unit is of type char, which is exactly 8 bits on all modern machines. C90 defines wide strings[1] which use a code unit of type wchar_t, which is 16 or 32 bits on modern machines. This was intended for Unicode but it is increasingly common to use UTF-8 in normal strings for Unicode instead.
Strings are passed to functions by passing a pointer to the first code unit. Since char* and wchar_t* are different types, the functions that process wide strings are different than the ones processing normal strings and have different names.
String literals ("text" in the C source code) are converted to arrays during compilation.[2] The result is an array of code units containing all the characters plus a trailing zero code unit. In C90 L"text" produces a wide string. A string literal can contain the zero code unit (one way is to put \0 into the source), but this will cause the string to end at that point. The rest of the literal will be placed in memory (with another zero code unit added to the end) but it is impossible to know those code units were translated from the string literal, therefore such source code is not a string literal.[3]
Character encodings
[edit]Each string ends at the first occurrence of the zero code unit of the appropriate kind (char or wchar_t). Consequently, a byte string (char*) can contain non-NUL characters in ASCII or any ASCII extension, but not characters in encodings such as UTF-16 (even though a 16-bit code unit might be nonzero, its high or low byte might be zero). The encodings that can be stored in wide strings are defined by the width of wchar_t. In most implementations, wchar_t is at least 16 bits, and so all 16-bit encodings, such as UCS-2, can be stored. If wchar_t is 32-bits, then 32-bit encodings, such as UTF-32, can be stored. (The standard requires a "type that holds any wide character", which on Windows no longer holds true since the UCS-2 to UTF-16 shift. This was recognized as a defect in the standard and fixed in C++.)[4] C++11 and C11 add two types with explicit widths char16_t and char32_t.[5]
Variable-width encodings can be used in both byte strings and wide strings. String length and offsets are measured in bytes or wchar_t, not in "characters", which can be confusing to beginning programmers. UTF-8 and Shift JIS are often used in C byte strings, while UTF-16 is often used in C wide strings when wchar_t is 16 bits. Truncating strings with variable-width characters using functions like strncpy can produce invalid sequences at the end of the string. This can be unsafe if the truncated parts are interpreted by code that assumes the input is valid.
Support for Unicode literals such as char foo[512] = "φωωβαρ"; (UTF-8) or wchar_t foo[512] = L"φωωβαρ"; (UTF-16 or UTF-32, depends on wchar_t) is implementation defined,[6] and may require that the source code be in the same encoding, especially for char where compilers might just copy whatever is between the quotes. Some compilers or editors will require entering all non-ASCII characters as \xNN sequences for each byte of UTF-8, and/or \uNNNN for each word of UTF-16. Since C11 (and C++11), a new literal prefix u8 is available that guarantees UTF-8 for a bytestring literal, as in char foo[512] = u8"φωωβαρ";.[7] Since C++20 and C23, a char8_t type was added that is meant to store UTF-8 characters and the types of u8 prefixed character and string literals were changed to char8_t and char8_t[] respectively.
Features
[edit]Terminology
[edit]In historical documentation the term "character" was often used instead of "byte" for C strings, which leads many[who?] to believe that these functions somehow do not work for UTF-8. In fact all lengths are defined as being in bytes and this is true in all implementations, and these functions work as well with UTF-8 as with single-byte encodings. The BSD documentation has been fixed to make this clear, but POSIX, Linux, and Windows documentation still uses "character" in many places where "byte" or "wchar_t" is the correct term.
Functions for handling memory buffers can process sequences of bytes that include null-byte as part of the data. Names of these functions typically start with mem, as opposite to the str prefix.
Headers
[edit]Most of the functions that operate on C strings are declared in the string.h header (cstring in C++), while functions that operate on C wide strings are declared in the wchar.h header (cwchar in C++). These headers also contain declarations of functions used for handling memory buffers; the name is thus something of a misnomer.
Functions declared in string.h are extremely popular since, as a part of the C standard library, they are guaranteed to work on any platform which supports C. However, some security issues exist with these functions, such as potential buffer overflows when not used carefully and properly, causing the programmers to prefer safer and possibly less portable variants, out of which some popular ones are listed below. Some of these functions also violate const-correctness by accepting a const string pointer and returning a non-const pointer within the string. To correct this, some have been separated into two overloaded functions in the C++ version of the standard library.
Constants and types
[edit]| Name | Notes |
|---|---|
| NULL | Macro expanding to the null pointer constant; that is, a constant representing a pointer value which is guaranteed not to be a valid address of an object in memory. |
| wchar_t | Type used for a code unit in "wide" strings. The C standard only requires that wchar_t be wide enough to hold the widest character set among the supported system locales[8] and be greater or equal in size to char.[9] On Windows, the only platform to use wchar_t extensively, it's defined as 16-bit[10] which was enough to represent any Unicode (UCS-2) character, but is now only enough to represent a UTF-16 code unit, which can be half a code point. On other platforms it is defined as 32-bit and a Unicode code point always fits. This difference makes code using wchar_t non-portable. |
| wint_t | Integer type that can hold any value of a wchar_t as well as the value of the macro WEOF. Usually a 32 bit signed value. |
| char8_t[11] | Part of the C standard since C23, in <uchar.h>, a type that is suitable for storing UTF-8 characters.[12] |
| char16_t[13] | Part of the C standard since C11,[14] in <uchar.h>, a type capable of holding 16 bits even if wchar_t is another size. If the macro __STDC_UTF_16__ is defined as 1, the type is used for UTF-16 on that system. This is always the case in C23.[15] C++ does not define such a macro, but the type is always used for UTF-16 in that language.[16]
|
| char32_t[13] | Part of the C standard since C11,[17] in <uchar.h>, a type capable of holding 32 bits even if wchar_t is another size. If the macro __STDC_UTF_32__ is defined as 1, the type is used for UTF-32 on that system. This is always the case in C23.[15] C++ does not define such a macro, but the type is always used for UTF-32 in that language.[16]
|
| mbstate_t | Contains all the information about the conversion state required from one call to a function to the other. |
Functions
[edit]| Byte string |
Wide string |
Description[note 1] | |
|---|---|---|---|
| String manipulation |
strcpy[18] | wcscpy[19] | Copies one string to another |
| strncpy[20] | wcsncpy[21] | Writes exactly n bytes, copying from source or adding nulls | |
| strcat[22] | wcscat[23] | Appends one string to another | |
| strncat[24] | wcsncat[25] | Appends no more than n bytes from one string to another | |
| strxfrm[26] | wcsxfrm[27] | Transforms a string according to the current locale | |
| String examination |
strlen[28] | wcslen[29] | Returns the length of the string |
| strcmp[30] | wcscmp[31] | Compares two strings (three-way comparison) | |
| strncmp[32] | wcsncmp[33] | Compares a specific number of bytes in two strings | |
| strcoll[34] | wcscoll[35] | Compares two strings according to the current locale | |
| strchr[36] | wcschr[37] | Finds the first occurrence of a byte in a string | |
| strrchr[38] | wcsrchr[39] | Finds the last occurrence of a byte in a string | |
| strspn[40] | wcsspn[41] | Returns the number of initial bytes in a string that are in a second string | |
| strcspn[42] | wcscspn[43] | Returns the number of initial bytes in a string that are not in a second string | |
| strpbrk[44] | wcspbrk[45] | Finds in a string the first occurrence of a byte in a set | |
| strstr[46] | wcsstr[47] | Finds the first occurrence of a substring in a string | |
| strtok[48] | wcstok[49] | Splits a string into tokens | |
| Miscellaneous | strerror[50] | — | Returns a string containing a message derived from an error code |
| Memory manipulation |
memset[51] | wmemset[52] | Fills a buffer with a repeated byte. Since C23, memset_explicit() was added to erase sensitive data. |
| memcpy[53] | wmemcpy[54] | Copies one buffer to another. Since C23, memccpy() was added to efficiently concatenate strings. | |
| memmove[55] | wmemmove[56] | Copies one buffer to another, possibly overlapping, buffer | |
| memcmp[57] | wmemcmp[58] | Compares two buffers (three-way comparison) | |
| memchr[59] | wmemchr[60] | Finds the first occurrence of a byte in a buffer | |
| |||
Multibyte functions
[edit]| Name | Description |
|---|---|
| mblen[61] | Returns the number of bytes in the next multibyte character |
| mbtowc[62] | Converts the next multibyte character to a wide character |
| wctomb[63] | Converts a wide character to its multibyte representation |
| mbstowcs[64] | Converts a multibyte string to a wide string |
| wcstombs[65] | Converts a wide string to a multibyte string |
| btowc[66] | Converts a single-byte character to wide character, if possible |
| wctob[67] | Converts a wide character to a single-byte character, if possible |
| mbsinit[68] | Checks if a state object represents initial state |
| mbrlen[69] | Returns the number of bytes in the next multibyte character, given state |
| mbrtowc[70] | Converts the next multibyte character to a wide character, given state |
| wcrtomb[71] | Converts a wide character to its multibyte representation, given state |
| mbsrtowcs[72] | Converts a multibyte string to a wide string, given state |
| wcsrtombs[73] | Converts a wide string to a multibyte string, given state |
| mbrtoc8[74] | Converts the next multibyte character to a UTF-8 character, given state |
| c8rtomb[75] | Converts a single code point from UTF-8 to a narrow multibyte character representation, given state |
| mbrtoc16[76] | Converts the next multibyte character to a UTF-16 character, given state |
| c16rtomb[77] | Converts a single code point from UTF-16 to a narrow multibyte character representation, given state |
| mbrtoc32[78] | Converts the next multibyte character to a UTF-32 character, given state |
| c32rtomb[79] | Converts a single code point from UTF-32 to a narrow multibyte character representation, given state |
These functions all need a mbstate_t object, originally in static memory (making the functions not be thread-safe) and in later additions the caller must maintain. This was originally intended to track shift states in the mb encodings, but modern ones such as UTF-8 do not need this. However these functions were designed on the assumption that the wc encoding is not a variable-width encoding and thus are designed to deal with exactly one wchar_t at a time, passing it by value rather than using a string pointer. As UTF-16 is a variable-width encoding, the mbstate_t has been reused to keep track of surrogate pairs in the wide encoding, though the caller must still detect and call mbtowc twice for a single character.[80][81][82] Later additions to the standard admit that the only conversion programmers are interested in is between UTF-8 and UTF-16 and directly provide this.
Numeric conversions
[edit]| Byte string |
Wide string |
Description[note 1] |
|---|---|---|
| atof[83] | — | converts a string to a floating-point value ('atof' means 'ASCII to float') |
| atoi atol atoll[84] |
— | converts a string to an integer (C99) ('atoi' means 'ASCII to integer') |
| strtof (C99)[85] strtod[86] strtold (C99)[87] |
wcstof (C99)[88] wcstod[89] wcstold (C99)[90] |
converts a string to a floating-point value |
| strtol strtoll[91] |
wcstol wcstoll[92] |
converts a string to a signed integer |
| strtoul strtoull[93] |
wcstoul wcstoull[94] |
converts a string to an unsigned integer |
| ||
The C standard library contains several functions for numeric conversions. The functions that deal with byte strings are defined in the stdlib.h header (cstdlib header in C++). The functions that deal with wide strings are defined in the wchar.h header (cwchar header in C++).
The functions strchr, bsearch, strpbrk, strrchr, strstr, memchr and their wide counterparts are not const-correct, since they accept a const string pointer and return a non-const pointer within the string. This has been fixed in C23.[95]
Also, since the Normative Amendment 1 (C95), atoxx functions are considered subsumed by strtoxxx functions, for which reason neither C95 nor any later standard provides wide-character versions of these functions. The argument against atoxx is that they do not differentiate between an error and a 0.[96]
Popular extensions
[edit]| Name | Source | Description |
|---|---|---|
| bzero[97][98] | BSD | Fills a buffer with zero bytes, deprecated by memset |
| memccpy[99] | SVID | Part of the C standard since C23, copies between two non-overlapping memory areas, stopping when a given byte is found. |
| mempcpy[100] | GNU | a variant of memcpy returning a pointer to the byte following the last written byte |
| strcasecmp[101] | BSD | case-insensitive version of strcmp |
| strcat_s[102] | Windows | a variant of strcat that checks the destination buffer size before concatenation |
| strcpy_s[103] | Windows | a variant of strcpy that checks the destination buffer size before copying |
| strdup & strndup[104] | POSIX | Part of the C standard since C23, allocates and duplicates a string |
| strerror_r[105] | POSIX 1, GNU | a variant of strerror that is thread-safe. The GNU version is incompatible with the POSIX one. |
| stricmp[106] | Windows | case-insensitive version of strcmp |
| strlcpy[107] | BSD | a variant of strcpy that truncates the result to fit in the destination buffer[108] |
| strlcat[107] | BSD | a variant of strcat that truncates the result to fit in the destination buffer[108] |
| strsignal[109] | POSIX:2008 | returns string representation of a signal code. Not thread safe. |
| strtok_r[110] | POSIX | a variant of strtok that is thread-safe |
strlcpy, strlcat
[edit]Despite the well-established need to replace strcat[22] and strcpy[18] with functions that do not allow buffer overflows, no accepted standard has arisen. This is partly due to the mistaken belief by many C programmers that strncat and strncpy have the desired behavior; however, neither function was designed for this (they were intended to manipulate null-padded fixed-size string buffers, a data format less commonly used in modern software), and the behavior and arguments are non-intuitive and often written incorrectly even by expert programmers.[108]
The most popular[a] replacement are the strlcat[111] and strlcpy[112] functions, which appeared in OpenBSD 2.4 in December, 1998.[108] These functions always write one NUL to the destination buffer, truncating the result if necessary, and return the size of buffer that would be needed, which allows detection of the truncation and provides a size for creating a new buffer that will not truncate. For a long time they have not been included in the GNU C library (used by software on Linux), on the basis of allegedly being inefficient,[113] encouraging the use of C strings (instead of some superior alternative form of string),[114][115] and hiding other potential errors.[116][117] Even while glibc hadn't added support, strlcat and strlcpy have been implemented in a number of other C libraries including ones for OpenBSD, FreeBSD, NetBSD, Solaris, OS X, and QNX, as well as in alternative C libraries for Linux, such as libbsd, introduced in 2008,[118] and musl, introduced in 2011,[119][120] and the source code added directly to other projects such as SDL, GLib, ffmpeg, rsync, and even internally in the Linux kernel. This did change in 2024, the glibc FAQ notes that as of glibc 2.38, the code has been committed [121] and thereby added.[122] These functions were standardized as part of POSIX.1-2024,[123] the Austin Group Defect Tracker ID 986 tracked some discussion about such plans for POSIX.
As part of its 2004 Security Development Lifecycle, Microsoft introduced a family of "secure" functions including strcpy_s and strcat_s (along with many others).[124] These functions were standardized with some minor changes as part of the optional C11 (Annex K) proposed by ISO/IEC WDTR 24731.[125] These functions perform various checks including whether the string is too long to fit in the buffer. If the checks fail, a user-specified "runtime-constraint handler" function is called,[126] which usually aborts the program.[127][128] These functions attracted considerable criticism because initially they were implemented only on Windows and at the same time warning messages started to be produced by Microsoft Visual C++ suggesting use of these functions instead of standard ones. This has been speculated by some to be an attempt by Microsoft to lock developers into its platform.[129] Experience with these functions has shown significant problems with their adoption and errors in usage, so the removal of Annex K was proposed for the next revision of the C standard.[130] Usage of memset_s has been suggested as a way to avoid unwanted compiler optimizations.[131][132]
See also
[edit]- C syntax § Strings – source code syntax, including backslash escape sequences
- C++ string handling
- String functions
- Perl Compatible Regular Expressions (PCRE)
Notes
[edit]- ^ On GitHub, there are 7,813,206 uses of
strlcpy, versus 38,644 uses ofstrcpy_s(and 15,286,150 uses ofstrcpy).[citation needed]
References
[edit]- ^ a b c "The C99 standard draft + TC3" (PDF). §7.1.1p1. Retrieved 7 January 2011.
{{cite web}}: CS1 maint: location (link) - ^ "The C99 standard draft + TC3" (PDF). §6.4.5p7. Retrieved 7 January 2011.
{{cite web}}: CS1 maint: location (link) - ^ "The C99 standard draft + TC3" (PDF). Section 6.4.5 footnote 66. Retrieved 7 January 2011.
{{cite web}}: CS1 maint: location (link) - ^ "Relax requirements on wchar_t to match existing practices" (PDF).
- ^ "Fundamental types". en.cppreference.com.
- ^ "The C99 standard draft + TC3" (PDF). §5.1.1.2 Translation phases, p1. Retrieved 23 December 2011.
{{cite web}}: CS1 maint: location (link) - ^ "string literals". en.cppreference.com. Retrieved 23 December 2019.
- ^ "stddef.h - standard type definitions". The Open Group. Retrieved 28 January 2017.
- ^ Gillam, Richard (2003). Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard. Addison-Wesley Professional. p. 714. ISBN 9780201700527.
- ^ "c++ - What is the use of wchar_t in general programming?". Stack Overflow. Retrieved 1 August 2022.
- ^ "char, wchar_t, char8_t, char16_t, char32_t". docs.microsoft.com. Retrieved 1 August 2022.
- ^ "char8_t".
- ^ a b "<cuchar> (uchar.h)".
- ^ "char16_t".
- ^ a b "Replacing text macros".
- ^ a b "Fundamental types".
- ^ "char32_t".
- ^ a b "strcpy - cppreference.com". En.cppreference.com. 2 January 2014. Retrieved 6 March 2014.
- ^ "wcscpy - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strncpy - cppreference.com". En.cppreference.com. 4 October 2013. Retrieved 6 March 2014.
- ^ "wcsncpy - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ a b "strcat - cppreference.com". En.cppreference.com. 8 October 2013. Retrieved 6 March 2014.
- ^ "wcscat - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strncat - cppreference.com". En.cppreference.com. 1 July 2013. Retrieved 6 March 2014.
- ^ "wcsncat - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strxfrm - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcsxfrm - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strlen - cppreference.com". En.cppreference.com. 27 December 2013. Retrieved 6 March 2014.
- ^ "wcslen - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strcmp - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcscmp - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strncmp - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcsncmp - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strcoll - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcscoll - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strchr - cppreference.com". En.cppreference.com. 23 February 2014. Retrieved 6 March 2014.
- ^ "wcschr - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strrchr - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcsrchr - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strspn - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcsspn - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strcspn - cppreference.com". En.cppreference.com. 31 May 2013. Retrieved 6 March 2014.
- ^ "wcscspn - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strpbrk - cppreference.com". En.cppreference.com. 31 May 2013. Retrieved 6 March 2014.
- ^ "wcspbrk - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strstr - cppreference.com". En.cppreference.com. 16 October 2013. Retrieved 6 March 2014.
- ^ "wcsstr - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strtok - cppreference.com". En.cppreference.com. 3 September 2013. Retrieved 6 March 2014.
- ^ "wcstok - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strerror - cppreference.com". En.cppreference.com. 31 May 2013. Retrieved 6 March 2014.
- ^ "memset - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wmemset - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "memcpy - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wmemcpy - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "memmove - cppreference.com". En.cppreference.com. 25 January 2014. Retrieved 6 March 2014.
- ^ "wmemmove - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "memcmp - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wmemcmp - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "memchr - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wmemchr - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mblen - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mbtowc - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wctomb - cppreference.com". En.cppreference.com. 4 February 2014. Retrieved 6 March 2014.
- ^ "mbstowcs - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcstombs - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "btowc - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wctob - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mbsinit - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mbrlen - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mbrtowc - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcrtomb - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mbsrtowcs - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcsrtombs - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "mbrtoc8 - cppreference.com". En.cppreference.com.
- ^ "c8rtomb - cppreference.com". En.cppreference.com.
- ^ "mbrtoc16 - cppreference.com". En.cppreference.com.
- ^ "c16rtomb - cppreference.com". En.cppreference.com.
- ^ "mbrtoc32 - cppreference.com". En.cppreference.com.
- ^ "c23rtomb - cppreference.com". En.cppreference.com.
- ^ "6.3.2 Representing the state of the conversion". The GNU C Library. Retrieved 31 January 2017.
- ^ "root/src/multibyte/c16rtomb.c". Retrieved 31 January 2017.
- ^ "Contents of /stable/11/lib/libc/locale/c16rtomb.c". Retrieved 31 January 2017.
- ^ "atof - cppreference.com". En.cppreference.com. 31 May 2013. Retrieved 6 March 2014.
- ^ "atoi, atol, atoll - cppreference.com". En.cppreference.com. 18 January 2014. Retrieved 6 March 2014.
- ^ "strtof, strtod, strtold - cppreference.com". En.cppreference.com. 4 February 2014. Retrieved 6 March 2014.
- ^ "strtof, strtod, strtold - cppreference.com". En.cppreference.com. 4 February 2014. Retrieved 6 March 2014.
- ^ "strtof, strtod, strtold - cppreference.com". En.cppreference.com. 4 February 2014. Retrieved 6 March 2014.
- ^ "wcstof, wcstod, wcstold - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcstof, wcstod, wcstold - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "wcstof, wcstod, wcstold - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strtol, strtoll - cppreference.com". En.cppreference.com. 4 February 2014. Retrieved 6 March 2014.
- ^ "wcstol, wcstoll - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "strtoul, strtoull - cppreference.com". En.cppreference.com. 4 February 2014. Retrieved 6 March 2014.
- ^ "wcstoul, wcstoull - cppreference.com". En.cppreference.com. Retrieved 6 March 2014.
- ^ "WG14-N3020 : Qualifier-preserving standard library functions, v4" (PDF). open-std.org. 13 June 2022.
- ^ C99 Rationale, 7.20.1.1
- ^ "bzero". The Open Group. Retrieved 27 November 2017.
- ^ "bzero(3)". OpenBSD. Retrieved 27 November 2017.
- ^ "memccpy". Pubs.opengroup.org. Retrieved 6 March 2014.
- ^ "mempcpy(3) - Linux manual page". Kernel.org. Retrieved 6 March 2014.
- ^ "strcasecmp(3) - Linux manual page". Kernel.org. Retrieved 6 March 2014.
- ^ "strcat_s, wcscat_s, _mbscat_s". docs.microsoft.com. Retrieved 22 April 2022.
- ^ "strcpy_s, wcscpy_s, _mbscpy_s, _mbscpy_s_l". docs.microsoft.com. Retrieved 22 April 2022.
- ^ "strdup". Pubs.opengroup.org. Retrieved 6 March 2014.
- ^ "strerror(3) - Linux manual page". man7.org. Retrieved 3 November 2019.
- ^ "String | stricmp()". C Programming Expert.com. Retrieved 6 March 2014.
- ^ a b "strlcpy, strlcat — size-bounded string copying and concatenation". OpenBSD. Retrieved 26 May 2016.
- ^ a b c d Todd C. Miller; Theo de Raadt (1999). "strlcpy and strlcat – consistent, safe, string copy and concatenation". USENIX '99.
- ^ "strsignal". Pubs.opengroup.org. Retrieved 6 March 2014.
- ^ "strtok". Pubs.opengroup.org. Retrieved 6 March 2014.
- ^ Todd C. Miller. "strlcpy.c". BSD Cross Reference.
- ^ Todd C. Miller. "strlcat.c". BSD Cross Reference.
- ^ Miller, Damien (October 2005). "Secure Portability" (PDF). Retrieved 26 June 2016.
This [strlcpy and strlcat] API has been adopted by most modern operating systems and many standalone software packages [...]. The notable exception is the GNU standard C library, glibc, whose maintainer steadfastly refuses to include these improved APIs, labelling them "horribly inefficient BSD crap", despite prior evidence that they are faster is most cases than the APIs they replace.
- ^ libc-alpha mailing list Archived 9 June 2007 at the Wayback Machine, selected messages from 8 August 2000 thread: 53, 60, 61
- ^ The ups and downs of strlcpy(); LWN.net
- ^ "Adding strlcpy() to glibc". lwn.net.
Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy).
- ^ – Linux Library Functions Manual from ManKier.com "However, one may question the validity of such optimizations, as they defeat the whole purpose of strlcpy() and strlcat(). As a matter of fact, the first version of this manual page got it wrong."
- ^ "libbsd". Retrieved 21 November 2022.
- ^ "root/src/string/strlcpy.c". Retrieved 28 January 2017.
- ^ "root/src/string/strlcat.c". Retrieved 28 January 2017.
- ^ strlc{py|at} commit
- ^ Discussion of strlcpy and strlcat in glibc 2.38 on Hacker News
- ^ "strlcat". Pubs.opengroup.org. Retrieved 5 September 2024.
- ^ Lovell, Martyn. "Repel Attacks on Your Code with the Visual Studio 2005 Safe C and C++ Libraries". Retrieved 13 February 2015.
- ^ Safe C Library. "The Safe C Library provides bound checking memory and string functions per ISO/IEC TR24731". Sourceforge. Retrieved 6 March 2013.
- ^ "The C11 standard draft" (PDF). §K.3.1.4p2. Retrieved 13 February 2013.
{{cite web}}: CS1 maint: location (link) - ^ "The C11 standard draft" (PDF). §K.3.6.1.1p4. Retrieved 13 February 2013.
{{cite web}}: CS1 maint: location (link) - ^ "Parameter Validation". 21 October 2022.
- ^ Danny Kalev. "They're at it again". InformIT. Archived from the original on 15 January 2012. Retrieved 10 November 2011.
- ^ "Field Experience With Annex K — Bounds Checking Interfaces". Retrieved 5 November 2015.
- ^ "MSC06-C. Beware of compiler optimizations". SEI CERT C Coding Standard.
- ^ – FreeBSD Library Functions Manual
External links
[edit]- Fast memcpy in C, multiple C coding examples to target different types of CPU instruction architectures
C string handling
View on Grokipedia\0).[1] These strings are not a distinct data type but are typically represented as arrays of the char type, where the length of a string is the number of characters preceding the null terminator.[1] String literals, such as "hello", are sequences of multibyte characters enclosed in double quotes, automatically appended with a null character during compilation to form an array of static storage duration.[1] Modifying the contents of a string literal results in undefined behavior, emphasizing the importance of using modifiable arrays for dynamic string operations.[1]
The primary mechanism for string handling is provided by the <string.h> header in the C standard library, which declares functions for copying, concatenating, comparing, searching, and other operations on strings and arrays of characters.[1] Key copying functions include strcpy, which copies a source string (including its null terminator) to a destination, and strncpy, which copies up to a specified number of characters and may pad with nulls if the source is shorter, though it does not always guarantee null termination if the source is longer or equal to the limit.[1] Concatenation is handled by strcat, which appends a source string to a destination, and strncat, a bounded version that always null-terminates the result.[1] Comparison functions like strcmp perform lexicographical comparisons returning negative, zero, or positive values based on the order of the strings, while strncmp limits the comparison to a given number of characters.[1]
Search operations in <string.h> enable locating characters or substrings, such as strchr for the first occurrence of a character in a string or strstr for the first occurrence of a substring, both returning a pointer to the match or NULL if not found.[1] The strlen function computes the length of a string by counting characters before the null terminator, excluding the terminator itself.[1] Miscellaneous utilities include strerror, which maps an error number to a descriptive string, and memory-oriented functions like memcpy for byte copying (without assuming null termination) and memset for filling memory blocks.[1] Many functions assume valid null-terminated inputs and sufficient destination space; violations, such as overlapping source and destination or buffer overflows, lead to undefined behavior.[1]
For wide-character strings (using wchar_t), equivalent functions are available in <wchar.h>, such as wcscpy and wcschr, supporting multibyte and wide-character encodings.[1] String input and output often intersect with <stdio.h>, where functions like fgets read lines into a character array (adding a null terminator and handling newlines) and fputs writes a string to a stream without its null terminator.[1] These conventions, standardized in ISO/IEC 9899, promote portability across systems while requiring programmers to manage memory bounds explicitly to avoid common pitfalls like buffer overruns.[1]
Fundamentals
Definitions and Representation
In the C programming language, a string is defined as a contiguous sequence of characters terminated by and including the first null character, which has the value 0 (denoted as '\0'). This null terminator serves as a sentinel value indicating the end of the string, distinguishing it from mere arrays of characters. Unlike languages with dedicated string types, C provides no built-in string data type; instead, strings are represented as arrays of thechar type (for narrow strings) or wchar_t (for wide strings), where the pointer to the string points to its initial character.[2]
The memory layout of a C string consists of a sequence of bytes in contiguous memory, followed by the null terminator, which is not included in the string's length. For instance, the string "hello" occupies six bytes: five for the characters 'h', 'e', 'l', 'l', 'o' and one for '\0'. String literals, such as those declared with double quotes, are stored in read-only memory and initialized as arrays with static storage duration, making direct modification undefined behavior. In contrast, modifiable strings can be declared as arrays, like char str[6];, allowing runtime assignment while ensuring space for the terminator. Pointers to strings, such as char *str = "[world](/page/World)";, reference the read-only literal without copying it, emphasizing C's pointer-based approach to string handling.
The length of a C string lacks an inherent field or metadata; it must be determined manually by traversing the array until the null terminator is encountered, often via a loop or the strlen function from the standard library. This design relies on the execution character set to interpret byte values as characters, though the structural representation remains independent of specific encodings.
Character Encodings
In the C programming language, strings are fundamentally sequences of bytes represented by thechar type, with ASCII serving as the foundational single-byte encoding for the basic 7-bit character set comprising 128 characters, including control codes and printable English letters. This encoding, standardized as American Standard Code for Information Interchange, assigns unique 7-bit values (0-127) to these characters, allowing them to fit within an 8-bit byte while leaving the eighth bit initially unused or available for extensions.[3][4]
As computing needs expanded beyond English-centric text, 8-bit extensions to ASCII emerged, such as the ISO-8859 family of standards, which define 256-character sets by utilizing the full byte range to include accented Latin characters, symbols, and region-specific glyphs while preserving the first 128 ASCII codes for compatibility. For broader international support, particularly in East Asian languages requiring thousands of characters, multibyte encodings like EUC (Extended UNIX Code) and UTF-8 were adopted; EUC employs fixed or variable byte sequences for CJK (Chinese, Japanese, Korean) ideographs, while UTF-8 provides a variable-width scheme (1-4 bytes per character in practice) that backward-compatibly encodes ASCII in its first 128 code points and extends to the full Unicode repertoire. These shifts addressed limitations in single-byte systems but introduced complexities in C's byte-oriented model.[4]
The char type in C is inherently byte-oriented, with its signedness implementation-defined: it may be treated as signed char (range -128 to 127) or unsigned char (0 to 255), potentially interpreting bytes with values 128-255 as negative when signed, which can affect arithmetic operations and comparisons involving non-ASCII characters. Historically, C originated in UNIX environments during the 1970s, assuming ASCII as the sole encoding, as documented in the original K&R specification; subsequent ISO C standards evolved this foundation, with C99 (ISO/IEC 9899:1999) introducing wide characters via wchar_t to support multibyte and Unicode encodings more natively, reflecting growing demands for internationalization.[4][1][3]
In variable-width encodings like UTF-8 and EUC, a key implication for C strings is the divergence between byte length (measured by functions like strlen) and the visual or semantic character count, as multi-byte characters inflate storage without a proportional increase in perceived length; for instance, a single Unicode ideograph might span three bytes, leading to potential mismatches in indexing or rendering if not accounted for. The null terminator, always the byte value 0 (ASCII NUL), remains invariant across encodings, serving as a reliable endpoint regardless of character width.[3]
Standard Library Overview
Headers and Declarations
The primary header for C string handling is<string.h>, which declares the majority of functions for manipulating null-terminated byte strings, along with constants such as NULL and types like size_t.[5] This header forms the core of the ISO C standard library's string facilities, providing prototypes for functions that perform operations like copying, concatenation, and searching on character arrays. It ensures portability across compliant implementations by standardizing the interface for these operations.
A secondary header, <strings.h>, offers non-const variants of some string functions, such as bcopy and bzero, which are useful for memory block operations but are not part of the ISO C standard; instead, they are POSIX-specific extensions.[6] Including <strings.h> exposes these additional utilities, which overlap with but differ from the const-correct versions in <string.h>, primarily for legacy compatibility in Unix-like environments.
For wide-character strings, the <wchar.h> header provides declarations for functions like wcslen, enabling handling of multibyte or wide-oriented strings in a locale-aware manner. This header extends the byte-string model to support international character sets, defining types such as wchar_t and wint_t essential for wide string operations.
Multibyte string conversions and locale-dependent behaviors rely on headers like <stdlib.h>, which declares functions such as mbstowcs for multibyte-to-wide conversions, and <locale.h>, which provides setup functions like setlocale to configure locale categories affecting string processing. These headers integrate with <string.h> and <wchar.h> to support non-ASCII character handling in internationalized applications.
Proper inclusion of these headers follows C preprocessor directives, typically via #include <header.h> statements at the top of source files, with guards like #ifndef HEADER_H and #define HEADER_H to prevent redundant inclusions across multiple files.[7] To access POSIX-specific features without conflicts, feature test macros such as _POSIX_C_SOURCE (e.g., defined to 200809L for POSIX.1-2008) are set before including headers, controlling the visibility of extensions like those in <strings.h>.[8] This practice ensures conditional compilation based on the target system's conformance level.[9]
The evolution of these headers reflects updates in the ISO C standards: the core declarations in <string.h> were established in C89 (ISO/IEC 9899:1990), with expansions in C99 (ISO/IEC 9899:1999) adding the restrict qualifier to function prototypes to enable optimizations assuming non-overlapping source and destination pointers, thereby enhancing safety in string operations.[10] POSIX extensions, including <strings.h> functions, predate but complement these standards, originating from Unix implementations and formalized in POSIX.1-1990.[6] Later revisions, such as C11 (ISO/IEC 9899:2011), refined multibyte support in <stdlib.h> and <locale.h> for better thread safety and internationalization.
Constants and Data Types
In C string handling, several predefined constants and data types are essential for representing sizes, states, and null pointers, ensuring portability and type safety across implementations. These are defined in the C standard library headers such as<stddef.h>, <stdlib.h>, and <wchar.h>, providing foundational elements for operations on null-terminated strings and multibyte/wide character sequences.[11]
The NULL macro represents an implementation-defined null pointer constant, typically defined as the integer constant expression 0 or as (void *)0, and is used to indicate the end of a string via a null terminator or to signal error conditions in pointer-returning functions.[11] It is declared in multiple headers including <stddef.h>, <stdio.h>, <stdlib.h>, <string.h>, <time.h>, <wchar.h>, and <locale.h>, ensuring consistent usage for pointer comparisons and initializations in string contexts.[11]
The size_t type is an unsigned integer type capable of representing the size of any object in bytes, as returned by the sizeof operator, and is the standard type for specifying lengths and counts in string functions, such as the return value of strlen.[11] It is defined in <stddef.h> and has a range sufficient to hold the maximum addressable object size on the implementation.[11] Introduced in earlier standards and retained in C11, size_t promotes portability by abstracting platform-specific size representations.[4]
C11 introduces the rsize_t type as a restricted variant of size_t, also an unsigned integer type from <stddef.h>, limited to the range [0, RSIZE_MAX] where RSIZE_MAX is at most SIZE_MAX but often smaller (e.g., 2^32 - 1 on 64-bit systems) to enable runtime bounds checking in secure library functions.[11] This type supports Annex K bounds-checking interfaces by facilitating the detection of invalid sizes, such as those exceeding available memory or derived from signed-to-unsigned conversions that yield large values.[11]
For multibyte character handling, the mbstate_t type is an opaque object type, other than an array, used to maintain the shift state during conversions between multibyte and wide character sequences, declared in <wchar.h>.[11] It tracks partial conversion states across function calls, ensuring correct parsing of locale-dependent multibyte encodings like UTF-8 or Shift-JIS.[11] Complementing this, the MB_CUR_MAX macro expands to a positive size_t expression giving the maximum number of bytes required for any multibyte character in the current locale, defined in <stdlib.h> and <wchar.h>, with a value never exceeding the constant MB_LEN_MAX (typically 16).[11]
Wide character support relies on the wchar_t type, an implementation-defined integer type from <stddef.h> and <wchar.h> whose range encompasses all distinct codes in the largest extended character set among supported locales, often 32 bits to accommodate Unicode.[11] The wint_t type, also from <wchar.h>, is an integer type capable of storing any valid wchar_t value plus the special WEOF endpoint, with a minimum range of -32767 to 32767 if signed or 0 to 65535 if unsigned, facilitating input/output operations on wide streams.[11]
Core String Functions
Manipulation and Copying
C string handling provides several functions in the<string.h> header for copying and modifying strings, which are essential for tasks like duplicating data or building composite strings. These functions operate on null-terminated character arrays and vary in their bounds checking and behavior. The primary copying functions are strcpy and strncat, which handle string-level operations including null terminators, while memcpy and memmove perform byte-level copies suitable for strings but without automatic null handling.[12]
The strcpy function copies the entire source string, including its null terminator, into the destination buffer, overwriting any existing content in the destination.
char *strcpy(char *restrict dest, const char *restrict src);
char *strcpy(char *restrict dest, const char *restrict src);
strncpy copies at most n bytes from the source to the destination, stopping early if the source ends before n characters.
char *strncpy(char *restrict dest, const char *restrict src, size_t n);
char *strncpy(char *restrict dest, const char *restrict src, size_t n);
n, strncpy pads the destination with null bytes up to n characters; however, it does not guarantee null termination if the source reaches or exceeds n bytes, potentially leaving the result non-null-terminated.[14] This padding behavior originated from the need to handle fixed-length fields, such as 14-character filenames in early UNIX directory entries, where full padding ensured consistent structure sizes without trailing nulls being interpreted as part of the data.[15] The function was introduced alongside strcpy in the Seventh Edition of UNIX in 1979.[15]
For appending, strcat concatenates the source string to the end of the destination by overwriting the destination's null terminator and adding a new one.
char *strcat(char *restrict dest, const char *restrict src);
char *strcat(char *restrict dest, const char *restrict src);
strcpy, it returns the destination pointer but has no bounds, so the destination must have enough space for both its original content and the source. The strncat function limits the append to at most n characters from the source (excluding the null terminator), always ensuring the result is null-terminated, even if fewer than n characters are appended.
char *strncat(char *restrict dest, const char *restrict src, size_t n);
char *strncat(char *restrict dest, const char *restrict src, size_t n);
n and copies accordingly, returning the destination pointer.
Byte-level functions like memcpy and memmove can also manipulate strings by copying raw memory blocks, useful when null terminators are managed separately or for non-overlapping transfers.
void *memcpy(void *restrict dest, const void *restrict src, size_t n);
void *memmove(void *dest, const void *src, size_t n);
void *memcpy(void *restrict dest, const void *restrict src, size_t n);
void *memmove(void *dest, const void *src, size_t n);
memcpy copies exactly n bytes without overlap checks, returning the destination pointer, while memmove handles potential overlaps safely by using a temporary buffer if needed.[16] Neither function appends or verifies null terminators, so they require explicit handling for string safety.
In C23, allocation-based duplication functions strdup and strndup were standardized, providing dynamic memory allocation for string copies. The strdup function allocates sufficient memory and copies the entire source string, including the null terminator, returning a pointer to the new string or NULL on failure (sets errno to ENOMEM).
char *strdup(const char *src);
char *strdup(const char *src);
strndup function copies at most n characters from the source, always null-terminating the result, and allocates exactly the required space plus the terminator.
char *strndup(const char *src, size_t n);
char *strndup(const char *src, size_t n);
free to avoid memory leaks, offering a safer alternative for duplicating strings without pre-allocated buffers.[17]
Unbounded functions like strcpy and strcat pose significant buffer overflow risks if the destination buffer lacks sufficient space, allowing attackers to overwrite adjacent memory and potentially execute arbitrary code. For instance, in historical exploits such as variants of the Code Red worm, unchecked copies via similar unbounded string operations enabled remote code execution by overflowing stack buffers.[18] Even bounded functions like strncpy and strncat can contribute to overflows if n exceeds available space or if non-termination leads to subsequent mishandling. Modern alternatives, such as BSD's strlcpy, address these by enforcing bounds and guaranteeing termination, though they are not part of the ISO C standard.[18]
Searching and Substring Operations
C string handling provides several functions in the<string.h> header for locating specific characters or substrings within null-terminated byte strings, enabling efficient pattern matching without modifying the original data. These functions return pointers to the found positions or NULL if no match exists, facilitating subsequent operations like extraction or analysis. They are defined since the C89 standard and remain part of subsequent revisions, including C99, C11, C17, and C23.
The strchr function searches a null-terminated byte string for the first occurrence of a specified character, treating the input character as an unsigned char after conversion. It scans from the beginning of the string pointed to by str until it finds the character or reaches the null terminator, which is also considered part of the searchable content. If found, it returns a pointer to that character within the original string; otherwise, it returns NULL. For example, strchr("hello", 'l') returns a pointer to the first 'l'. This behavior ensures compatibility with strings ending in the searched character, such as searching for '\0' to locate the end.[19][20]
Complementing strchr, the strrchr function performs a backward search to find the last occurrence of the character in the string. It begins scanning from the end (excluding the null terminator initially but including it in the search) and returns a pointer to the last matching character or NULL if none is found. This is useful for tasks like extracting file extensions from paths, as in strrchr("/path/to/file.txt", '/') returning a pointer to the last '/'. Like strchr, it considers the null terminator, so searching for '\0' yields a pointer to the string's end. Both functions exhibit undefined behavior if the input string pointer is NULL or not properly null-terminated.[21][20]
For substring searches, strstr locates the first occurrence of a null-terminated substring needle within another null-terminated byte string haystack, without comparing the terminating null characters. It returns a pointer to the start of the matching substring in haystack or NULL if no match is found. If needle is an empty string (i.e., just a null terminator), strstr returns haystack itself. For instance, strstr("one two three", "two") points to the 't' in "two". The function does not support overlapping matches explicitly; it finds the leftmost occurrence. Undefined behavior occurs if either pointer is NULL or the strings are not null-terminated. Since C23, a type-generic variant adjusts the return type based on input constness.[22][23]
The strpbrk function scans a null-terminated byte string for the first occurrence of any character from a specified set of bytes in another null-terminated string breakset. It returns a pointer to that character in the original string or NULL if no match exists. This is efficient for delimiter detection, such as strpbrk("hello world", " \t") returning a pointer to the space. The search treats breakset as a set, ignoring duplicates and order. Like other functions, it invokes undefined behavior for NULL pointers or non-null-terminated inputs. It stops at the first match, without considering overlaps.[24][25]
Tokenization is handled by strtok, which breaks a string into a sequence of tokens separated by delimiters from a null-terminated set. The first call provides the string pointer and delimiters; subsequent calls pass NULL for the string to continue from the previous position, using an internal static pointer for state. It modifies the original string by replacing delimiters with null bytes and returns a pointer to each token or NULL when no more tokens exist. Consecutive delimiters are treated as one, and empty tokens are skipped. For example, tokenizing "A,B,,D" with "," as delimiter yields "A", "B", and "D". This non-reentrant design, relying on static storage, makes strtok unsuitable for multithreaded use or recursive calls. Undefined behavior results from NULL inputs or non-null-terminated strings; an empty string or all-delimiters case returns NULL immediately. A bounds-checked, reentrant variant strtok_s was introduced in C11 for safer usage.[26][27]
For byte-level searches in arbitrary memory blocks, memchr examines up to count bytes starting from ptr, seeking the first occurrence of a byte value (converted from int to unsigned char). It returns a void pointer to the matching byte or NULL if not found within the range. Unlike string functions, it does not require null termination and operates on raw memory, making it suitable for binary data. For example, memchr("hello", 'l', 3) finds the first 'l' within the first three bytes. If count is zero, it returns NULL without accessing memory. NULL ptr or exceeding buffer bounds leads to undefined behavior. Since C11, it is well-defined if a match is within a smaller accessible array. A type-generic version exists in C23.[28][29]
These functions handle edge cases consistently but require careful invocation to avoid undefined behavior. Passing NULL pointers or non-null-terminated strings results in undefined behavior across all, potentially causing crashes or incorrect results. For empty strings, strchr and strrchr return NULL unless searching for '\0', in which case they point to the terminator; strstr returns the empty string pointer; strpbrk returns NULL if the breakset is empty; strtok returns NULL immediately; and memchr with zero count returns NULL. Overlapping searches are not directly supported but can occur implicitly in strstr or repeated strchr calls, though no function guarantees handling overlaps without additional logic. Length awareness, often via strlen, aids in bounding searches to prevent overruns.[19][22][26]
Comparison and Ordering
In C string handling, comparison functions enable lexicographical ordering of strings based on their character representations, facilitating tasks such as sorting arrays of strings or validating equality between text data. These operations typically interpret characters as unsigned bytes for byte-wise comparison, stopping at the null terminator for null-terminated strings or at a specified length limit. The results indicate relative order: a negative value if the first string precedes the second, zero if they are equal, and positive if the first follows the second.[30][31] Thestrcmp function performs a case-sensitive, byte-wise comparison of two null-terminated strings, s1 and s2, by examining characters from the beginning until a difference is found or both reach their null terminators. It returns the difference between the unsigned byte values of the first differing characters, effectively providing a signed integer that reflects their lexicographical order under the current character encoding. For instance, if s1 is "apple" and s2 is "banana", strcmp returns a negative value since 'a' (ASCII 97) is less than 'b' (ASCII 98). This function is defined in the ISO C standard and is commonly used for simple equality checks or as a comparator in sorting algorithms like quicksort on string arrays.[30][31]
To limit comparisons to a specific number of bytes and avoid risks from unterminated or overly long strings, the strncmp function compares at most n characters of two possibly null-terminated strings, treating a null character as less than any other. It returns zero if the first n bytes match (or if n is zero), or the signed difference of the first mismatched bytes otherwise, ensuring safer handling in scenarios like comparing fixed-length fields in protocols. For example, strncmp("hello", "help", 3) returns zero because the first three bytes match, despite the full strings differing. This variant is also part of the ISO C standard and is recommended for bounded comparisons to prevent buffer overruns.[32][31]
For binary-safe comparisons beyond null-terminated strings, the memcmp function compares the first n bytes of two memory blocks pointed to by ptr1 and ptr2, interpreting them as unsigned bytes without regard for null terminators. It returns the signed difference of the first differing bytes or zero if all n bytes match, making it suitable for verifying equality of binary data structures or string buffers that may contain embedded nulls. Unlike string-specific functions, memcmp does not stop early at nulls, so it requires exact length specification to avoid undefined behavior from overreading. This function is integral to the ISO C standard and is often employed in low-level data validation or hashing contexts.[33][31]
Locale-aware comparisons are provided by the strcoll function, which orders two null-terminated strings according to the collation rules defined in the current locale's LC_COLLATE category, rather than raw byte values. This accounts for cultural sorting conventions, such as treating accented characters appropriately in non-English locales, and returns a negative, zero, or positive value based on the locale-specific order. For example, in a French locale, "é" might collate after "e" but before "f", differing from byte-order comparisons. Defined in the ISO C standard, strcoll is essential for internationalized applications like database indexing or user interface sorting where locale impacts perceived order.[34][31]
POSIX systems extend these with case-insensitive variants: strcasecmp compares two null-terminated strings ignoring case differences, behaving as if both were converted to lowercase in the POSIX locale, while strncasecmp limits this to n bytes. Both return values analogous to strcmp and strncmp, supporting use cases like user input matching or file name sorting where case variations should not affect order. These functions, declared in <strings.h>, originated in BSD and are standardized in POSIX.1-2001, but are not part of ISO C.[35]
Overall, these functions underpin string ordering in C programs, with byte-wise methods suiting performance-critical or encoding-agnostic needs, while locale support enhances portability across languages; encoding choices can influence order in non-ASCII contexts, though byte comparisons remain consistent within a given encoding.[30][31]
Numeric Conversions
The numeric conversion functions in the C standard library enable the parsing of integer and floating-point values from null-terminated byte strings, facilitating the transformation of textual representations into machine-readable numeric types. These functions are declared in<stdlib.h> and are essential for processing user input, configuration files, or data streams containing embedded numbers. They typically skip leading whitespace, interpret optional signs, and stop at the first invalid character, providing mechanisms to detect parsing errors and overflows.[36]
The simplest integer conversion functions are atoi, atol, and atoll, which interpret a string as a base-10 integer and return values of type int, long, and long long, respectively. For example, atoi("123") returns 123, while atoi("-456") returns -456; these functions discard leading whitespace and halt at the first non-digit after the optional sign. However, they offer no explicit error reporting: if no valid conversion occurs, they return 0, and if the value exceeds the return type's range, the behavior is undefined. atoll was introduced in C99 to support 64-bit integers.[36]
For more robust integer parsing, strtol and strtoul convert strings to signed and unsigned long integers, respectively, supporting bases from 2 to 36 or auto-detection (base 0). The syntax is long strtol(const char *str, char **endptr, int base);, where endptr (if non-null) points to the first unconverted character, allowing detection of invalid input. For instance, strtol("10FF", &endptr, 16) converts "10" to 16 in hexadecimal and sets endptr to point after the digits. Letters A-Z or a-z represent values 10-35 in higher bases. These functions return 0 if no conversion is possible and clamp to LONG_MIN/LONG_MAX or ULONG_MAX on overflow, setting errno to ERANGE. strtoll and strtoull extend this to long long since C99.[37]
Floating-point conversion is handled by strtod, which parses a string into a double value, supporting both decimal and scientific notation as well as hexadecimal floating-point formats. The function signature is double strtod(const char *str, char **endptr);, mirroring strtol in its use of endptr for partial parsing detection. It accepts formats like "3.14", "-2.5e+3", or "0x1.8p3" (hexadecimal with binary exponent), skipping leading whitespace and an optional sign. On success, it returns the converted value; no conversion yields 0, while overflow returns HUGE_VAL (or underflow to 0), with errno set to ERANGE in both cases. Variants strtof and strtold target float and long double.[38]
The scanf family, including sscanf for string-based input, provides formatted numeric parsing via specifiers like %d for integers and %f for floats. For example, sscanf(buf, "%d %f", &i, &f) assigns a decimal integer to i and a floating-point value to f from the string buf, consuming leading whitespace and respecting field widths (e.g., %5d limits to five characters). %d behaves like strtol with base 10, while %f matches strtod's formats, including scientific notation. To prevent buffer overflows in %s (string input), specify width like %10s. The functions return the number of successful assignments; a mismatch or EOF yields a lower count or EOF, enabling error detection without endptr. Secure variants like sscanf_s (C11) add runtime checks for invalid pointers and overflows.[39]
Error handling in these conversions emphasizes checking for overflows via ERANGE in errno (which must be zeroed beforehand) and invalid inputs through endptr or scanf's return value. For strtol and strtoul, overflow clamps the result to the type's limits and sets ERANGE; similarly, strtod signals range errors with HUGE_VAL and ERANGE. atoi and family lack such diagnostics, making them unsuitable for production code where robustness is needed. scanf detects mismatches by returning fewer assignments than expected, but it leaves invalid input in the stream for further processing.[40]
Locale settings influence numeric conversions through the LC_NUMERIC category, set via setlocale(LC_NUMERIC, "locale_name"), which defines the decimal point character (e.g., "." in "C" locale or "," in many European locales). This affects strtod and %f in scanf, where the locale's radix character separates integer and fractional parts; for example, in a French locale, strtod("3,14", NULL) returns 3.14. Integer functions like strtol remain unaffected, as they do not parse decimals. The "C" or "POSIX" locale ensures portable behavior with a period as the decimal point.[41]
Multibyte and Locale Support
Multibyte Conversion Functions
In the C standard library, multibyte conversion functions enable the handling of international character encodings by converting between sequences of bytes representing multibyte characters and wide characters of typewchar_t, which provide a fixed-size representation for characters beyond the basic execution character set. These functions are essential for portable internationalization, supporting encodings where characters may span multiple bytes, such as UTF-8 or EUC.[42]
The function mblen determines the number of bytes comprising the next multibyte character starting at the pointer s, examining up to n bytes without performing the conversion; if s is a null pointer, it returns a nonzero value if the multibyte encoding is state-dependent or zero otherwise. Similarly, mbtowc converts the multibyte character at s (up to n bytes) to a corresponding wide character stored in *pwc if pwc is not null, returning the number of bytes processed for a valid conversion, zero if the multibyte sequence represents the null wide character, or -1 if invalid (setting errno to EILSEQ). The inverse operation, wctomb, converts a wide character wc to its multibyte representation starting at s (with buffer size up to MB_CUR_MAX bytes), returning the number of bytes written or -1 for an invalid wide character; a call with s as null resets the shift state and tests for state-dependency. For example, in a UTF-8 locale, mbtowc might process two bytes for 'é' (0xC3 0xA9) to yield the wide character value U+00E9.
Bulk conversions are handled by mbstowcs, which translates a null-terminated multibyte string at s into a wide character array at pwcs, writing up to n wide characters (excluding the null terminator) and stopping at the first null byte or error, returning the number of wide characters converted or (size_t)-1 on failure. Conversely, wcstombs performs the reverse, converting a null-terminated wide character string at pwcs to multibyte bytes at s (up to n bytes, excluding terminator), returning the bytes written or (size_t)-1 for invalid sequences. These functions process entire strings efficiently but rely on the same underlying conversion logic as their single-character counterparts.[1]
State management in multibyte conversions is critical for encodings with state-dependent representations, where the interpretation of bytes depends on prior shift sequences, such as in ISO-2022 variants or encodings like SJIS that require tracking multi-byte boundaries across calls; the basic functions maintain an opaque internal shift state, reset by null pointer arguments or null characters.[42] Since C95, the type mbstate_t—an implementation-defined opaque object initialized to all-zero bits for the initial shift state—enables restartable conversions in extended functions (e.g., mbrtowc and c32rtomb in <wchar.h>), allowing explicit state passing to avoid indeterminate behavior when processing streams incrementally or after interruptions. This prevents issues in stateful encodings by preserving the conversion context between calls, ensuring correct handling of partial characters.[1]
In C23, additional functions provide specific support for UTF-8 encoding using the char8_t type, defined in <uchar.h>. The mbrtoc8 function converts a multibyte character from the current locale to a UTF-8 encoded char8_t, inspecting up to n bytes and returning the number of bytes processed or -1 on error. Conversely, c8rtomb converts a UTF-8 code unit sequence to a multibyte character in the current locale, returning the bytes written. These functions standardize UTF-8 handling independently of the locale's multibyte encoding.[43]
All these functions return -1 to indicate encoding errors (with errno set to EILSEQ) and zero when encountering the null wide character, facilitating error detection and null-termination handling. For compatibility, in the default "C" locale, the functions fall back to single-byte behavior, treating each byte as a distinct character matching the execution character set, with no multi-byte sequences recognized.
Locale-Dependent Behavior
In C string handling, locale-dependent behavior arises primarily through the configuration of locale categories that influence character classification, collation, and related operations. Thesetlocale function, declared in <locale.h>, allows programs to set or query the current locale for specific categories or the entire environment.[44] When invoked with the LC_CTYPE category, setlocale configures character classification and multibyte character handling, affecting functions that determine properties like alphabetic or digit status based on the active locale's encoding and rules.[44] Similarly, the LC_COLLATE category governs string collation sequences, impacting comparison and sorting operations by defining the order of characters beyond simple byte values.[44]
Character classification macros and functions, such as isalpha, isdigit, isalnum, isupper, and islower from <ctype.h>, test whether a character belongs to specific classes and are directly influenced by the LC_CTYPE category of the current locale.[45] In a given locale, these functions consult predefined tables to classify characters; for example, isalpha(c) returns a non-zero value if c represents an alphabetic character according to the locale's definition, which may include accented letters in locales like French or German but excludes them in stricter ones. The _l variants, such as isalpha_l, allow explicit specification of a locale object for more controlled testing.[45] Multibyte string functions, like those in <wchar.h>, also rely on LC_CTYPE for interpreting shift states and character encodings.[46]
For wide-character support, the <wctype.h> header provides extensible classification functions, including iswalpha, iswdigit, and iswalnum, which operate on wint_t values and similarly depend on the locale's LC_CTYPE settings.[47] The iswalpha(wc) function returns non-zero if the wide character wc is alphabetic in the current locale, accommodating Unicode ranges in wide-character locales while adhering to the same category rules as narrow-character counterparts.[47] The _l variants, like iswalpha_l, enable locale-specific invocation, enhancing flexibility in multithreaded or varied-encoding environments.[47]
The default "C" locale, activated when no explicit locale is set (e.g., via environment variables like LC_ALL=C), provides a portable baseline equivalent to the 7-bit ASCII character set, where alphabetic characters are strictly A–Z and a–z, digits are 0–9, and collation follows ASCII numerical order.[46] This locale ensures consistent behavior across systems but limits support for international characters, making it suitable for ASCII-only applications while potentially requiring switches to richer locales for global text handling.[46]
Prior to C11, setlocale modifications affected the entire process, posing challenges in multithreaded programs due to lack of thread-safety.[44] The C11 standard introduces per-thread locale management via the uselocale function in <locale.h>, which sets a thread-specific locale object (obtained from newlocale or duplocale) without altering the global state, thereby enabling safe, independent locale configurations across threads.[48] Invoking uselocale((locale_t)0) queries the current thread's locale, and using LC_GLOBAL_LOCALE reverts to the process-wide setting, supporting concurrent string operations with diverse cultural conventions.[48]
Implementations of locales, such as in the GNU C Library (glibc), load LC_CTYPE data from system-defined locale archives or files (e.g., under /usr/share/i18n/), where encoding tables map byte values to properties like case conversion and classification bits via compiled locale definition sources.[49] These tables are typically binary structures optimized for quick lookup, with LC_COLLATE loading collation weights for strcmp-like comparisons; upon setlocale calls, the runtime parses and caches these for the specified category, ensuring efficient access during string operations.[50]
Extensions and Modern Practices
BSD and Secure Extensions
Thestrlcpy and strlcat functions provide bounded string copying and concatenation operations designed to mitigate buffer overflow vulnerabilities inherent in unbounded string handling.[51] These functions limit the number of bytes written to the destination buffer to a specified size, ensuring the buffer remains null-terminated regardless of whether truncation occurs.[52] Unlike the standard strncpy and strncat, which may leave the destination unterminated if the source exceeds the size limit and pad shorter sources with null bytes, strlcpy and strlcat always append a null terminator within the size limit and return the total length required for the full operation (including the null terminator), allowing callers to detect and handle truncation explicitly.[51] For example, strlcpy(dest, src, [size](/page/Size)) copies at most size - 1 bytes from src to dest and null-terminates it, returning the length of src to indicate if more space was needed.
These functions were developed by Todd C. Miller and Theo de Raadt in 1998 as part of efforts to enhance security in the OpenBSD operating system, first appearing in OpenBSD 2.4.[53] They address portability and consistency issues in string operations across systems, promoting safer alternatives to traditional C library functions.[51]
Although not part of the ISO C standard, strlcpy and strlcat have been adopted in various Unix-like systems, including all major BSD variants (OpenBSD, FreeBSD, NetBSD), Solaris, and macOS.[54] On Linux, they are available through the libbsd compatibility library or, more recently, natively in glibc 2.38 and later. Some compilers, such as those in the GCC family with BSD extensions, also provide these functions via header includes like <bsd/string.h>.
Other BSD-derived extensions further improve security in string handling. The strndup function serves as a bounded variant of strdup, allocating memory for and copying at most a specified number of bytes from the source string, always null-terminating the result to prevent overflows in dynamic allocations, originally a BSD extension but standardized in C23 (ISO/IEC 9899:2024).[55][56] Similarly, explicit_bzero offers a secure memory zeroing operation equivalent to bzero or memset with zero, but it resists compiler optimizations that might eliminate the store as dead code, making it suitable for clearing sensitive data like cryptographic keys.[55] This function originated in OpenBSD 5.5 and has been integrated into FreeBSD, NetBSD, and glibc.[53]
Common Pitfalls and Best Practices
One of the most prevalent vulnerabilities in C string handling is the buffer overflow, where functions likestrcpy copy data without checking destination buffer bounds, potentially overwriting adjacent memory and enabling code execution or crashes.[57] A classic example is the use of strcpy, which assumes unlimited destination space, leading to overflows if the source string exceeds the allocated buffer.[58] The 2014 Heartbleed vulnerability in OpenSSL exemplified this issue through a buffer over-read in the TLS heartbeat extension, where a missing bounds check allowed attackers to disclose up to 64 kilobytes of sensitive memory per connection.[59]
Null pointer dereferences represent another critical pitfall, as functions such as strlen invoke undefined behavior when passed a null pointer, potentially causing segmentation faults or erratic program termination without prior validation.[60]
Truncation problems arise with functions like strncpy, which fail to append a null terminator if the copy count equals the source length matching the buffer size, resulting in unterminated strings that can trigger subsequent overflows or misreads.[61]
To mitigate these risks, developers should always employ bounded functions such as strncpy for copying and snprintf for formatting, ensuring the destination size is specified to prevent overflows. Input validation is essential, including checks for null pointers and length limits before processing strings in complex subsystems. Static analysis tools, like those enforcing CERT C rules (e.g., Rosecheckers), help detect such issues during development by scanning for unbounded operations and unvalidated inputs.
Modern guidance includes adopting C11 Annex K's bounds-checked functions, such as strcpy_s, which require explicit buffer size arguments and return error codes on violations, though their optional status and implementation challenges have sparked controversy, with limited adoption and calls for deprecation.[62]
For performance and safety in multithreaded environments, avoid strtok due to its non-thread-safe use of static state, which can corrupt results across concurrent calls; instead, use reentrant alternatives like strtok_r.[63] Similarly, opt for strnlen to safely compute string lengths by capping the search at a specified maximum, avoiding overruns on unterminated buffers.[64] Secure BSD extensions like strlcpy offer consistent null termination and truncation detection as alternatives to traditional functions.[52]