Hubbry Logo
search
logo

C string handling

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
C string handling

The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

The only support for strings in the programming language proper is that the compiler translates quoted string constants into null-terminated strings.

A string is defined as a contiguous sequence of code units terminated by the first zero code unit (often called the NUL code unit). This means a string cannot contain the zero code unit, as the first one seen marks the end of the string. The length of a string is the number of code units before the zero code unit. The memory occupied by a string is always one more code unit than the length, as space is needed to store the zero terminator.

Generally, the term string means a string where the code unit is of type char, which is exactly 8 bits on all modern machines. C90 defines wide strings which use a code unit of type wchar_t, which is 16 or 32 bits on modern machines. This was intended for Unicode but it is increasingly common to use UTF-8 in normal strings for Unicode instead.

Strings are passed to functions by passing a pointer to the first code unit. Since char* and wchar_t* are different types, the functions that process wide strings are different than the ones processing normal strings and have different names.

String literals ("text" in the C source code) are converted to arrays during compilation. The result is an array of code units containing all the characters plus a trailing zero code unit. In C90 L"text" produces a wide string. A string literal can contain the zero code unit (one way is to put \0 into the source), but this will cause the string to end at that point. The rest of the literal will be placed in memory (with another zero code unit added to the end) but it is impossible to know those code units were translated from the string literal, therefore such source code is not a string literal.

Each string ends at the first occurrence of the zero code unit of the appropriate kind (char or wchar_t). Consequently, a byte string (char*) can contain non-NUL characters in ASCII or any ASCII extension, but not characters in encodings such as UTF-16 (even though a 16-bit code unit might be nonzero, its high or low byte might be zero). The encodings that can be stored in wide strings are defined by the width of wchar_t. In most implementations, wchar_t is at least 16 bits, and so all 16-bit encodings, such as UCS-2, can be stored. If wchar_t is 32-bits, then 32-bit encodings, such as UTF-32, can be stored. (The standard requires a "type that holds any wide character", which on Windows no longer holds true since the UCS-2 to UTF-16 shift. This was recognized as a defect in the standard and fixed in C++.) C++11 and C11 add two types with explicit widths char16_t and char32_t.

Variable-width encodings can be used in both byte strings and wide strings. String length and offsets are measured in bytes or wchar_t, not in "characters", which can be confusing to beginning programmers. UTF-8 and Shift JIS are often used in C byte strings, while UTF-16 is often used in C wide strings when wchar_t is 16 bits. Truncating strings with variable-width characters using functions like strncpy can produce invalid sequences at the end of the string. This can be unsafe if the truncated parts are interpreted by code that assumes the input is valid.

See all
User Avatar
No comments yet.