User-defined function
View on WikipediaA user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its creator.
BASIC language
[edit]In some old implementations of the BASIC programming language, user-defined functions are defined using the "DEF FN" syntax. More modern dialects of BASIC are influenced by the structured programming paradigm, where most or all of the code is written as user-defined functions or procedures, and the concept becomes practically redundant.
COBOL language
[edit]In the COBOL programming language, a user-defined function is an entity that is defined by the user by specifying a FUNCTION-ID paragraph. A user-defined function must return a value by specifying the RETURNING phrase of the procedure division header and they are invoked using the function-identifier syntax. See the ISO/IEC 1989:2014 Programming Language COBOL standard for details.
As of May 2022, the IBM Enterprise COBOL for z/OS 6.4 (IBM COBOL) compiler contains support for user-defined functions.
Databases
[edit]In relational database management systems, a user-defined function provides a mechanism for extending the functionality of the database server by adding a function, that can be evaluated in standard query language (usually SQL) statements. The SQL standard distinguishes between scalar and table functions. A scalar function returns only a single value (or NULL), whereas a table function returns a (relational) table comprising zero or more rows, each row with one or more columns.
User-defined functions in SQL are declared using the CREATE FUNCTION statement. For example, a user-defined function that converts Celsius to Fahrenheit (a temperature scale used in USA) might be declared like this:
CREATE FUNCTION dbo.CtoF(Celsius FLOAT)
RETURNS FLOAT
RETURN (Celsius * 1.8) + 32
Once created, a user-defined function may be used in expressions in SQL statements. For example, it can be invoked where most other intrinsic functions are allowed. This also includes SELECT statements, where the function can be used against data stored in tables in the database. Conceptually, the function is evaluated once per row in such usage. For example, assume a table named Elements, with a row for each known chemical element. The table has a column named BoilingPoint for the boiling point of that element, in Celsius. The query
SELECT Name, CtoF(BoilingPoint)
FROM Elements
would retrieve the name and the boiling point from each row. It invokes the CtoF user-defined function as declared above in order to convert the value in the column to a value in Fahrenheit.
Each user-defined function carries certain properties or characteristics. The SQL standard defines the following properties:
- Language - defines the programming language in which the user-defined function is implemented; examples include SQL, C, C# and Java.
- Parameter style - defines the conventions that are used to pass the function parameters and results between the implementation of the function and the database system (only applicable if language is not SQL).
- Specific name - a name for the function that is unique within the database. Note that the function name does not have to be unique, considering overloaded functions. Some SQL implementations require that function names are unique within a database, and overloaded functions are not allowed.
- Determinism - specifies whether the function is deterministic or not. The determinism characteristic has an influence on the query optimizer when compiling a SQL statement.
- SQL-data access - tells the database management system whether the function contains no SQL statements (NO SQL), contains SQL statements but does not access any tables or views (CONTAINS SQL), reads data from tables or views (READS SQL DATA), or actually modifies data in the database (MODIFIES SQL DATA).
User-defined functions should not be confused with stored procedures. Stored procedures allow the user to group a set of SQL commands. A procedure can accept parameters and execute its SQL statements depending on those parameters. A procedure is not an expression and, thus, cannot be used like user-defined functions.
Some database management systems allow the creation of user defined functions in languages other than SQL. Microsoft SQL Server, for example, allows the user to use .NET languages including C# for this purpose. DB2 and Oracle support user-defined functions written in C or Java programming languages.
SQL Server 2000
[edit]There are three types of UDF in Microsoft SQL Server 2000: scalar functions, inline table-valued functions, and multistatement table-valued functions.
Scalar functions return a single data value (not a table) with RETURNS clause. Scalar functions can use all scalar data types, with exception of timestamp and user-defined data types. Inline table-valued functions return the result set of a single SELECT statement. Multistatement table-valued functions return a table, which was built with many TRANSACT-SQL statements.
User-defined functions can be invoked from a query like built‑in functions such as OBJECT_ID, LEN, DATEDIFF, or can be executed through an EXECUTE statement like stored procedures.
Performance Notes:
- On Microsoft SQL Server 2000 a table-valued function which "wraps" a View may be much faster than the View itself. The following MyFunction is an example of a "function-wrapper" which runs faster than the underlying view MyView:
CREATE FUNCTION MyFunction() RETURNS @Tbl TABLE ( StudentID VARCHAR(255), SAS_StudentInstancesID INT, Label VARCHAR(255), Value MONEY, CMN_PersonsID INT ) AS BEGIN INSERT @Tbl ( StudentID, SAS_StudentInstancesID, Label, Value, CMN_PersonsID ) SELECT StudentID, SAS_StudentInstancesID, Label, Value, CMN_PersonsID FROM MyView -- where MyView selects (with joins) the same columns from large table(s) RETURN END
- On Microsoft SQL Server 2005 the result of the same code execution is the opposite: view is executed faster than the "function-wrapper".
User-defined functions are subroutines made of one or more Transact-SQL statements that can be used to encapsulate code for reuse. It takes zero or more arguments and evaluates a return value. Has both control-flow and DML statements in its body similar to stored procedures. Does not allow changes to any Global Session State, like modifications to database or external resource, such as a file or network. Does not support output parameter. DEFAULT keyword must be specified to pass the default value of parameter. Errors in UDF cause UDF to abort which, in turn, aborts the statement that invoked the UDF.
CREATE FUNCTION CubicVolume
-- Input dimensions in centimeters
(
@CubeLength decimal(4,1),
@CubeWidth decimal(4,1),
@CubeHeight decimal(4,1)
)
RETURNS decimal(12,3)
AS
BEGIN
RETURN(@CubeLength * @CubeWidth * @CubeHeight)
END
Apache Hive
[edit]Apache Hive defines, in addition to the regular user-defined functions (UDF), also user-defined aggregate functions (UDAF) and table-generating functions (UDTF).[1] Hive enables developers to create their own custom functions with Java.[2]
Apache Doris
[edit]Apache Doris, an open-source real-time analytical database, allows external users to contribute their own UDFs written in C++ to it.[3]
References
[edit]- ^ "LanguageManual UDF - Apache Hive - Apache Software Foundation". 26 June 2015.
- ^ "HivePlugins - Apache Hive - Apache Software Foundation". 26 June 2015.
- ^ "Apache Doris UDF". Archived from the original on 10 April 2023. Retrieved 8 April 2023.
External links
[edit]User-defined function
View on Grokipediamain function.[4] Similarly, Python defines UDFs using the def keyword, where the interpreter treats them as callable objects in the symbol table, supporting features like default arguments and docstrings for documentation.[1] This approach contrasts with built-in functions (e.g., print in Python or printf in C), as UDFs are entirely user-crafted to address domain-specific needs.[1][4]
In database contexts, such as SQL Server or BigQuery, UDFs integrate custom logic directly into queries, facilitating operations like data transformation or aggregation that exceed standard SQL expressions.[2][6] Common types include scalar UDFs, which return a single value (e.g., a computed metric), and table-valued UDFs, which produce a result set akin to a virtual table; these can be implemented in Transact-SQL, JavaScript, or other languages depending on the system.[2][7] The primary advantages of UDFs across contexts include promoting modular programming by allowing functions to be created once, stored, and invoked repeatedly; improving maintainability through isolated modifications; and optimizing performance, such as by caching execution plans or minimizing data transfer in queries.[2][8]
General Concepts
Definition
A user-defined function (UDF), also known as a custom function, is a reusable block of code written by a programmer or user to execute a specific task, thereby extending the functionality of a programming language or system beyond its built-in functions.[9][4] UDFs promote modularity and code reusability by encapsulating logic that can be invoked multiple times with varying inputs, reducing redundancy in larger programs.[10] Key attributes of UDFs include support for custom parameters (inputs passed to the function), return values (outputs produced by the function), scope (determining visibility and accessibility, such as local scope within a specific block or global scope across the program), and invocation syntax (the mechanism to call the function from other parts of the code).[11][12] For example, in pseudocode, a basic UDF declaration might appear as follows, illustrating parameter input and a return value:Function add_numbers
Pass In: integer first_number, integer second_number
Set sum to first_number plus second_number
Pass Out: sum
Endfunction
This function can then be invoked elsewhere in the program with Call: add_numbers(5, 3), yielding a return value of 8.[12]
UDFs are typically distinguished from procedures or subroutines in that they return a value to the calling code, whereas procedures primarily perform actions without producing an output value.[13][14] This return mechanism enables UDFs to compute and pass results back for further use, aligning with their role in functional programming paradigms across various computing contexts.
Types and Characteristics
Classifications of user-defined functions (UDFs) vary depending on the context, such as general programming languages versus database systems. In programming languages like C, UDFs are commonly categorized based on the number of arguments and whether they return a value. These include: functions with no arguments and no return value (e.g., a simple print function); functions with arguments but no return value; functions with no arguments but a return value; and functions with both arguments and a return value.[15] In database systems, additional specialized types exist, such as scalar functions (returning a single value) and table-valued functions (returning a dataset), with details covered in later sections.[2] Beyond classification, UDFs exhibit several core characteristics that enhance their utility in software systems. Modularity is a primary attribute, as UDFs allow developers to encapsulate reusable blocks of code, reducing redundancy and improving maintainability across applications.[16] Encapsulation further strengthens this by abstracting intricate internal logic behind a clean, parameterized interface, promoting separation of concerns.[16] Many UDF implementations support recursion, where the function invokes itself to address problems through successive approximations, though recursion depth is often capped to avoid excessive resource consumption.[17] Additionally, UDFs integrate seamlessly with control structures, including conditional statements and iterative loops, enabling the embedding of algorithmic flows directly within the function.[18] In practice, UDFs serve critical use cases centered on data processing and extension of system capabilities. They are frequently applied in data transformation tasks, such as normalizing or formatting inputs for analysis, and in performing specialized calculations that tailor computations to domain-specific needs.[16] Custom algorithms, including those for business rule enforcement or predictive modeling steps, also leverage UDFs to extend native functionality without altering core system code.[2]History
Origins in Programming Languages
The concept of user-defined functions traces its roots to subroutines in low-level programming during the 1940s and 1950s, where they emerged as a means to promote code reuse and modularity in assembly languages. Early subroutines were often implemented as sequences of instructions linked via jump vectors or transfer mechanisms, allowing programmers to invoke reusable code blocks without duplicating instructions. A pivotal advancement occurred in 1947 when David J. Wheeler at the University of Cambridge devised the closed subroutine technique, which used a linkage mechanism to store return addresses and parameters, facilitating more efficient control flow; this innovation underpinned the subroutine support in the EDSAC computer, operational by 1949.[19] These assembly-level subroutines laid the groundwork for higher-level abstractions by addressing the challenges of repetitive coding in machine-oriented environments. The transition to high-level languages brought user-defined subprograms into mainstream use, with Fortran II in 1958 marking a significant milestone by introducing the SUBROUTINE and FUNCTION statements as precursors to modern user-defined functions. Developed under John Backus at IBM, these features allowed programmers to define modular procedures that could be compiled independently, return values, and handle parameters, thereby enabling procedural programming and reducing the complexity of large-scale scientific computations on machines like the IBM 704.[20][21] Similarly, ALGOL 58, formalized in 1958 by an international committee, incorporated procedures as a core construct, supporting block structures, parameter passing by value or name, and recursion, which facilitated top-down design and influenced the principles of structured programming by emphasizing hierarchical decomposition over unstructured jumps.[22] Further formalization occurred with PL/I in 1964, where IBM's design integrated robust function definitions drawing from Fortran's numerical focus, ALGOL's structural elegance, and COBOL's data handling, allowing user-defined procedures with advanced scoping, recursion, and multitasking support to unify scientific and business programming paradigms.[23] Concurrently, theoretical influences from Alonzo Church's lambda calculus of the 1930s found practical expression in the 1960s through John McCarthy's Lisp, which adopted lambda abstractions to define anonymous functions and enable higher-order operations on symbolic expressions, bridging mathematical foundations with computable function definitions.[24] These developments collectively established user-defined functions as essential for modular, maintainable code, paving the way for structured programming methodologies that prioritized clarity and verifiability.[25]Evolution in Database Systems
The evolution of user-defined functions (UDFs) in database systems began in the late 1980s and early 1990s as relational databases sought to extend SQL's declarative nature with procedural capabilities for greater extensibility and custom logic integration. For example, IBM DB2 introduced support for user-defined functions in Version 2, released in 1993, allowing custom routines in C or other languages to extend SQL capabilities. Oracle introduced PL/SQL, its procedural extension to SQL, in 1992 with Oracle Database 7, allowing developers to create UDFs that could be invoked within SQL statements to perform complex computations and enhance query functionality.[26] Similarly, in the open-source domain, PostgreSQL added support for procedural languages, including PL/pgSQL, starting with version 6.4 in 1998, which enabled the definition of UDFs to encapsulate reusable code for data manipulation and business rules directly in the database. The SQL standard played a pivotal role in formalizing UDFs across implementations. The ISO/IEC 9075-4:1999 standard, known as SQL/PSM (Persistent Stored Modules), introduced a procedural language for defining UDFs, stored procedures, and triggers, promoting portability and standardization for database extensibility.[27] Microsoft SQL Server advanced this trend by introducing scalar and table-valued UDFs in Transact-SQL with version 2000 in 2000, with significant enhancements in SQL Server 2000 that improved integration with queries and supported multi-statement table functions for more sophisticated data returns.[28] Further innovation came in 2005 with SQL Server's integration of the Common Language Runtime (CLR), enabling UDFs written in .NET languages like C# for leveraging external libraries and handling tasks beyond native T-SQL capabilities, such as advanced string processing or mathematical operations.[29] As data volumes exploded in the 2000s, UDFs evolved to support distributed and big data environments, particularly with the rise of Apache Hadoop in 2006, which emphasized scalable processing over traditional relational models. Apache Hive, a data warehousing layer atop Hadoop initially developed by Facebook starting in 2007 and accepted into the Apache Software Foundation in 2008 with its initial release (0.1.0) in October 2008, incorporated UDFs as a core feature to extend HiveQL, allowing custom Java-based functions to process massive datasets via MapReduce jobs without rewriting core engine logic. Its first major stable release (version 1.0) occurred in 2015.[30] This shift marked UDFs' adaptation to NoSQL and distributed systems, where they facilitated extensibility in handling unstructured data and parallel computations, building on earlier programming language concepts but tailored for fault-tolerant, cluster-based architectures.In Programming Languages
BASIC
In the BASIC programming language, user-defined functions were introduced in the original Dartmouth BASIC implementation of 1964 to facilitate teaching programming to non-experts by allowing simple, reusable computations within educational programs.[31] This feature enabled students to define custom functions alongside the language's built-in ones, promoting modular code in a time-sharing environment on systems like the GE-225.[31] The syntax for defining such functions in early dialects, such as Dartmouth BASIC and later GW-BASIC from the 1980s, uses theDEF FN statement followed by a function name (e.g., FNA to FNZ), an optional single argument in parentheses, and an equals sign leading to an expression.[31][32] For instance, in GW-BASIC, the form is DEF FNname[(argument)] = expression, where the function name must begin with "FN" and adhere to variable naming rules, and the expression computes the return value.[32] These functions are invoked by referencing the FN name with arguments, such as FNSQR(16), and must be defined before use in the program to avoid errors like "Undefined user function."[32]
User-defined functions in these BASIC variants are limited to returning scalar values—either a single numeric or string result—without support for complex data types like arrays or objects, reflecting the language's focus on simplicity for beginners.[32] The argument, if present, is a simple variable replaced upon invocation, and the expression can reference program variables but must incorporate the argument unless the function is constant.[31] Up to 26 such functions could be defined, named FNA through FNZ, emphasizing concise, single-line definitions suitable for interpretive execution.[31]
This mechanism evolved significantly in Microsoft Visual Basic, released in 1991, which introduced structured Function and Sub procedures to replace the single-line DEF FN approach, enabling multi-line code blocks with explicit return statements for greater flexibility in event-driven applications.[33]
An example of a user-defined function in GW-BASIC to calculate the area of a circle uses the following code:
10 DEF FNAREA(R) = 3.14159 * R * R
20 PRINT FNAREA(5)
This defines FNAREA to return the area based on radius R and outputs approximately 78.54 when executed.[32]
COBOL
In COBOL, modularity for reusable code blocks in business-oriented tasks such as data processing and calculations has traditionally been achieved through subprograms invoked via the CALL statement and modular procedures managed with the PERFORM statement.[34][35] These constructs promote maintainability in large-scale enterprise applications, where COBOL's verbose, English-like syntax facilitates handling financial and administrative data in legacy systems.[35] COBOL, developed in 1959 under the guidance of the Conference on Data Systems Languages (CODASYL), initially supported subprograms for modularity through the CALL statement in its 1960 specifications, allowing separate compilation units to handle specific logic while sharing data via parameters.[36] This evolved with the ANSI COBOL-85 standard and IBM's VS COBOL II release in 1985, which introduced nested subprograms and inline PERFORM statements to integrate procedure-like routines more seamlessly within the main program flow.[37] The syntax for invoking these elements relies on paragraphs or sections within the PROCEDURE DIVISION, executed via PERFORM statements for intra-program modularity, or CALL statements for external subprograms. For example, a PERFORM might reference a named paragraph likePERFORM PAYROLL-CALCULATION. to execute a block of statements, while later COBOL-85 dialects support inline PERFORM for direct embedding, such as PERFORM VARYING I FROM 1 BY 1 UNTIL I > 10 ADD 1 TO COUNTER END-PERFORM..[38] Subprograms, in contrast, use CALL 'SUBPROG-NAME' USING ARG1 ARG2. to transfer control, with arguments passed positionally.[34]
A key feature is the emphasis on the DATA DIVISION's LINKAGE SECTION to define parameters for subprograms, ensuring type-safe data exchange between calling and called units without global variables. This supports business logic for arithmetic operations (e.g., additions, multiplications for totals) and string manipulations (e.g., formatting reports or validating identifiers), aligning with COBOL's record-oriented design for handling structured data like employee records.[39] Parameters are typically passed by reference (default), allowing modifications to reflect back to the caller, which is ideal for updating fields in shared records during computations.[40]
True user-defined functions (UDFs), which can be invoked in expressions and return a single value directly, were introduced in the ISO/IEC 1989:2002 standard to extend COBOL's capabilities for modern applications.[41] These are defined using a FUNCTION-ID paragraph in a separate program or nested, with invocation via the FUNCTION keyword, such as COMPUTE result = MY-FUNCTION(arg1). UDFs support return types like numeric or alphanumeric and enhance interoperability, though adoption remains limited in legacy environments as of 2025.
The following example illustrates a traditional UDF-like subprogram for payroll calculation, where a main program calls a separate routine to compute net pay from gross earnings and deductions. The subprogram uses the LINKAGE SECTION for input/output parameters and returns control via EXIT PROGRAM.
Main Program (PAYROLL-DRIVER.cbl):
IDENTIFICATION DIVISION.
PROGRAM-ID. PAYROLL-DRIVER.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 GROSS-PAY PIC 9(5)V99 VALUE 5000.00.
01 DEDUCTIONS PIC 9(5)V99 VALUE 1000.00.
01 NET-PAY PIC 9(5)V99.
PROCEDURE DIVISION.
MAIN-PARA.
MOVE GROSS-PAY TO LINK-GROSS.
MOVE DEDUCTIONS TO LINK-DEDUCT.
CALL 'PAYROLL-CALC' USING LINK-GROSS, LINK-DEDUCT, NET-PAY.
DISPLAY 'Net Pay: ' NET-PAY.
STOP RUN.
Subprogram (PAYROLL-CALC.cbl):
IDENTIFICATION DIVISION.
PROGRAM-ID. PAYROLL-CALC.
DATA DIVISION.
LINKAGE SECTION.
01 LINK-GROSS PIC 9(5)V99.
01 LINK-DEDUCT PIC 9(5)V99.
01 LINK-NET PIC 9(5)V99.
PROCEDURE DIVISION USING LINK-GROSS, LINK-DEDUCT, LINK-NET.
CALC-PARA.
SUBTRACT LINK-DEDUCT FROM LINK-GROSS GIVING LINK-NET.
EXIT PROGRAM.
This routine demonstrates parameter passing for a simple subtraction-based net pay calculation, extensible for more complex business rules like tax computations.[42][39]
Fortran and Procedural Languages
In Fortran, user-defined functions are defined using theFUNCTION statement to specify the function name, formal argument list, and return type, with the function returning a value through assignment to its own name before a RETURN or END statement. Subroutines, which perform actions without returning a single value, are defined using the SUBROUTINE statement and invoked via a CALL statement. For example, a user-defined function to compute the average of three real numbers can be written as follows in Fortran 77 syntax:
REAL FUNCTION AVERAGE(X, Y, Z)
REAL X, Y, Z, SUM
SUM = X + Y + Z
AVERAGE = SUM / 3.0
RETURN
END
This function is called in an expression, such as result = AVERAGE(a, b, c), where a, b, and c are actual arguments passed by position. Parameters in Fortran functions and subroutines are passed by reference by default, meaning the memory address of the argument is provided, allowing modifications to affect the caller's variables.[43][44]
The Fortran 77 standard, approved by ANSI in 1978, standardized the core syntax for user-defined functions and subroutines, including block-structured control flow enhancements like the IF-THEN-ELSE construct, which improved modularity for numerical algorithms in scientific computing. This standard emphasized Fortran's role in procedural programming for high-performance computations, such as simulations in physics and engineering. The subsequent Fortran 90 standard, published by ISO in 1991, extended these capabilities by introducing the RECURSIVE keyword for functions and subroutines that can call themselves, requiring a RESULT variable to distinguish the return value from recursive invocations, as in:
RECURSIVE REAL FUNCTION FACTORIAL(N) RESULT(RES)
INTEGER :: N
INTEGER :: RES
IF (N <= 1) THEN
RES = 1
ELSE
RES = N * FACTORIAL(N - 1)
END IF
END FUNCTION FACTORIAL
Fortran 90 also added modules to encapsulate functions, subroutines, and data, along with generic interfaces for procedure overloading based on argument types, facilitating reusable code in large-scale numerical applications.[45][46][47]
In procedural contexts, Fortran's user-defined functions prioritize numerical computations, enabling efficient handling of arrays and mathematical operations central to scientific workflows, such as solving differential equations or matrix manipulations. This focus on modularity and performance influenced procedural paradigms in other languages; for instance, in C, user-defined functions require prototypes—declarations specifying the return type, function name, and parameter types—to enable type checking before definition, with arguments passed by value unless pointers simulate reference passing. Fortran's approach also impacted Pascal, released in 1970, which adopted similar procedural elements, including the convention of using the function name as a local variable for the return value within the function body.[48][49]
In Database Systems
SQL Standard
The SQL:1999 standard, formally known as ISO/IEC 9075:1999, introduced user-defined functions (UDFs) as part of its enhancements to the SQL language, enabling users to define custom functions that extend the core query capabilities of relational databases.[50] The CREATE FUNCTION statement serves as the primary mechanism for defining UDFs, supporting scalar functions that return a single value, table-valued functions that return a result set resembling a table, and aggregate functions that perform computations over groups of rows.[51] The basic syntax includes the function name, a parameter list enclosed in parentheses—where parameters can be specified as IN, OUT, or INOUT with their data types—and a RETURNS clause specifying the return type, such as a scalar data type for scalar UDFs, TABLE for table-valued UDFs, or an aggregate return specification for aggregate UDFs.[52] The function body, enclosed in BEGIN...END, contains SQL statements or procedural logic to implement the desired behavior.[53] These UDFs integrate seamlessly into SQL queries, allowing invocation within SELECT statements, WHERE clauses, or other expressions as if they were built-in functions; for instance, a scalar UDF can be called in a projection list like SELECT my_udf(column1) FROM table1, while a table-valued UDF can appear in the FROM clause as a derived table.[51] Later revisions, such as SQL:2003, refined these features for better portability and performance in distributed environments.[52] Key concepts in the standard include function overloading, where multiple UDFs with the same name but differing parameter types or counts can coexist within the same schema, resolved at invocation based on argument signatures.[53] Schema binding ensures that UDFs are tied to their defining schema, preventing accidental modifications to underlying objects and promoting referential integrity across database sessions.[52] Temporary UDFs, declared within a session or module, provide scoped functionality that exists only for the duration of the connection, declared using LOCAL TEMPORARY modifiers to avoid namespace pollution.[53] The SQL/PSM (Persistent Stored Modules) extension, standardized in ISO/IEC 9075-4 as part of SQL:1999, specifically supports procedural UDFs by providing control structures like loops, conditionals, and exception handling within function bodies, allowing complex logic beyond simple SQL expressions.[53] This enables the creation of robust, reusable procedural routines that can modify data (with appropriate MODIFIES SQL DATA clauses) or read-only operations, integrated directly into the SQL environment.[51]Microsoft SQL Server
In Microsoft SQL Server, user-defined functions (UDFs) are implemented primarily using Transact-SQL (T-SQL) to create modular routines that accept parameters, perform computations, and return either a single scalar value or a table of results. These functions enhance query reusability and encapsulation, aligning with the SQL standard while extending capabilities through vendor-specific features. Scalar UDFs return a single value, such as a calculated decimal, and can be either inline (single SELECT statement) or multi-statement, whereas table-valued UDFs return a result set, further divided into inline (optimized for single-query performance) and multi-statement types. Introduced in SQL Server 2000, these T-SQL UDFs support schema binding for deterministic functions to enable indexing on computed columns.[2][54] The basic syntax for creating a scalar UDF in T-SQL follows this structure:CREATE FUNCTION [schema_name.]function_name
(@[parameter](/page/Parameter)_name parameter_data_type)
RETURNS return_data_type
[WITH <function_option> [ ,...n ] ]
AS
BEGIN
-- Function body with logic
RETURN expression
END
For instance, a simple tax calculation function might be defined as CREATE FUNCTION dbo.CalcTax(@amount [DECIMAL](/page/Decimal)(10,2)) RETURNS [DECIMAL](/page/Decimal)(10,2) AS BEGIN RETURN @amount * 0.08; END, demonstrating parameter input and scalar output. Table-valued functions use RETURNS TABLE with an inline SELECT or a multi-statement block to populate a table variable. These T-SQL implementations benefit from query plan caching to reduce compilation overhead but historically limited parallelism in scalar UDFs due to serial execution.[54][17]
Since SQL Server 2005, Common Language Runtime (CLR) integration has allowed UDFs to be written in .NET languages like C# for complex computations, string manipulation, or aggregation tasks where T-SQL performance is suboptimal. CLR scalar and table-valued functions are created using CREATE FUNCTION with an EXTERNAL NAME clause referencing a compiled assembly method, enabling access to richer libraries while maintaining database-level execution. This feature addresses T-SQL limitations in computational intensity, though it requires enabling CLR hosting via sp_configure and assembly deployment.[55][56]
Subsequent updates have expanded UDF versatility; starting with SQL Server 2016, built-in JSON functions like JSON_VALUE and FOR JSON can be incorporated into UDF bodies to parse, query, or generate JSON data stored as NVARCHAR, facilitating hybrid relational-NoSQL workloads without external processing. In SQL Server 2019 and later, scalar UDF inlining was introduced as part of Intelligent Query Processing to mitigate longstanding performance bottlenecks, automatically transforming eligible T-SQL scalar UDFs into equivalent expressions during query optimization for parallelism and reduced invocation overhead—potentially cutting execution times dramatically on large datasets. However, inlining may trigger warnings for incompatible UDFs, such as those with side effects, non-deterministic elements, or division-by-zero risks, requiring explicit disabling via database options or query hints if issues arise.[57][58]
Apache Hive
User-defined functions (UDFs) in Apache Hive enable users to extend the SQL-like query language (HiveQL) with custom logic for processing large-scale data on Hadoop clusters, addressing limitations in built-in functions for complex analytics. Introduced in Hive version 0.4 in 2010, UDFs allow developers to implement bespoke operations directly within queries, facilitating schema-on-read processing of unstructured or semi-structured data.[59][60] Implementation of UDFs in Hive is primarily Java-based, where developers extend theorg.apache.hadoop.hive.ql.udf.UDF class for basic scalar functions or the org.apache.hadoop.hive.ql.udf.generic.GenericUDF class—introduced in Hive 0.10—for handling complex types and more flexible argument processing. The GenericUDF approach supports type-safe evaluations and is recommended for modern implementations due to its robustness with Hive's dynamic typing system. To register a UDF, the syntax is CREATE [TEMPORARY] FUNCTION function_name AS 'fully_qualified_class_name' [USING JAR 'jar_path'];, followed by invocation in a SELECT statement like SELECT function_name(column) FROM table;. This mechanism integrates the compiled JAR into the Hive execution environment, enabling distributed computation across the cluster.[59][61][62]
Hive also supports User-Defined Table-Generating Functions (UDTFs), which extend row generation capabilities by producing multiple output rows from a single input row, useful for operations like array explosions or custom partitioning in big data workflows; these are implemented by extending org.[apache.hadoop](/page/Apache_Hadoop).hive.ql.udf.generic.GenericUDTF and registered similarly to UDFs. For non-Java languages, Hive provides limited Python support through the TRANSFORM clause for streaming scripts or via integration with Hive on Spark, where PySpark UDFs can operate on Hive tables for enhanced scripting flexibility.[63][64][65]
Apache Doris
Apache Doris, originally developed as Palo in 2011 by Baidu to support high-concurrency OLAP workloads, evolved into an open-source MPP analytical database focused on real-time data processing and low-latency queries.[66] User-defined functions (UDFs) in Apache Doris enable developers to extend its SQL capabilities with custom logic, primarily through Java implementations that integrate seamlessly with its frontend (FE) and backend (BE) architecture. The FE handles function registration and metadata management, while the BE nodes execute the UDFs during query processing, leveraging Doris's distributed execution model for scalability.[67][66] Implementation of UDFs in Doris centers on Java, supporting scalar UDFs for row-level transformations and aggregate UDFs (UDAFs) for group-wise computations, with table-valued UDFs (UDTFs) added later for generating multiple output rows per input. Developers write UDFs by extending base classes likeScalarUDF or implementing UDAF interfaces with methods such as update and getResult, then package them into JAR files. Registration occurs via the CREATE FUNCTION statement, specifying input/output types and the JAR location, for example:
CREATE FUNCTION add_one(input INT) RETURNS INT
PROPERTIES (
"file" = "file:///path/to/addone.jar",
"symbol" = "com.example.AddOne",
"type" = "JAVA_UDF"
);
This syntax allows immediate use in SQL queries, such as SELECT add_one(5);, with type mappings between Doris SQL types (e.g., INT to Java Integer) ensuring compatibility.[67][68]
A key feature is Doris's support for vectorized execution of UDFs, introduced progressively starting with version 0.15 in October 2021, which processes data in batches using SIMD instructions to boost performance in analytical workloads by reducing function call overhead. This integration aligns UDFs with Doris's columnar storage and query engine, enabling efficient handling of large-scale data without custom optimizations. Further enhancements in version 1.0, released in April 2022, improved cloud-native deployment and stability for UDFs, including better resource isolation on BE nodes via configurable JVM options like heap size.[69][70][71]
UDFs in Doris are particularly valuable for custom aggregations in OLAP scenarios, where built-in functions may fall short for domain-specific needs, such as advanced geospatial processing. For instance, a geospatial UDF could compute custom distance metrics between points stored in Doris's GEO type, enabling queries like aggregating nearby locations in real-time reports: SELECT geo_udf_distance(geo_point1, geo_point2) FROM locations GROUP BY region;. This extends Doris's native geospatial support, like ST_Distance, for tailored analytics in applications such as logistics or location-based services, while maintaining sub-second query latencies.[67][72]