Recent from talks
Nothing was collected or created yet.
Multi-task learning
View on WikipediaMulti-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.[1][2][3] Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks.[4] Early versions of MTL were called "hints".[5][6]
In a widely cited 1997 paper, Rich Caruana gave the following characterization:
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.[3]
In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.[citation needed] Further examples of settings for MTL include multiclass classification and multi-label classification.[7]
Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.[8] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.[8][9]
Methods
[edit]The key challenge in multi-task learning, is how to combine learning signals from multiple tasks into a single model. This may strongly depend on how well different task agree with each other, or contradict each other. There are several ways to address this challenge:
Task grouping and overlap
[edit]Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.[10] Task relatedness can be imposed a priori or learned from the data.[7][11] Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.[8][12] For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.[8]
Exploiting unrelated tasks
[edit]One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.[9]
Transfer of knowledge
[edit]Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet,[13] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.[14]
Multiple non-stationary tasks
[edit]Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL).[15] Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.
Multi-task optimization
[edit]Multi-task optimization focuses on solving optimizing the whole process.[16][17] The paradigm has been inspired by the well-established concepts of transfer learning[18] and multi-task learning in predictive analytics.[19]
The key motivation behind multi-task optimization is that if optimization tasks are related to each other in terms of their optimal solutions or the general characteristics of their function landscapes,[20] the search progress can be transferred to substantially accelerate the search on the other.
The success of the paradigm is not necessarily limited to one-way knowledge transfers from simpler to more complex tasks. In practice an attempt is to intentionally solve a more difficult task that may unintentionally solve several smaller problems.[21]
There is a direct relationship between multitask optimization and multi-objective optimization.[22]
In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models.[23] Commonly, MTL models employ task-specific modules on top of a joint feature representation obtained using a shared module. Since this joint representation must capture useful features across all tasks, MTL may hinder individual task performance if the different tasks seek conflicting representation, i.e., the gradients of different tasks point to opposing directions or differ significantly in magnitude. This phenomenon is commonly referred to as negative transfer. To mitigate this issue, various MTL optimization methods have been proposed. Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics.
There are several common approaches for multi-task optimization: Bayesian optimization, evolutionary computation, and approaches based on Game theory.[16]
Multi-task Bayesian optimization
[edit]Multi-task Bayesian optimization is a modern model-based approach that leverages the concept of knowledge transfer to speed up the automatic hyperparameter optimization process of machine learning algorithms.[24] The method builds a multi-task Gaussian process model on the data originating from different searches progressing in tandem.[25] The captured inter-task dependencies are thereafter utilized to better inform the subsequent sampling of candidate solutions in respective search spaces.
Evolutionary multi-tasking
[edit]Evolutionary multi-tasking has been explored as a means of exploiting the implicit parallelism of population-based search algorithms to simultaneously progress multiple distinct optimization tasks. By mapping all tasks to a unified search space, the evolving population of candidate solutions can harness the hidden relationships between them through continuous genetic transfer. This is induced when solutions associated with different tasks crossover.[17][26] Recently, modes of knowledge transfer that are different from direct solution crossover have been explored.[27][28]
Game-theoretic optimization
[edit]Game-theoretic approaches to multi-task optimization propose to view the optimization problem as a game, where each task is a player. All players compete through the reward matrix of the game, and try to reach a solution that satisfies all players (all tasks). This view provide insight about how to build efficient algorithms based on gradient descent optimization (GD), which is particularly important for training deep neural networks.[29] In GD for MTL, the problem is that each task provides its own loss, and it is not clear how to combine all losses and create a single unified gradient, leading to several different aggregation strategies.[30][31][32] This aggregation problem can be solved by defining a game matrix where the reward of each player is the agreement of its own gradient with the common gradient, and then setting the common gradient to be the Nash Cooperative bargaining[33] of that system.
Applications
[edit]Algorithms for multi-task optimization span a wide array of real-world applications. Recent studies highlight the potential for speed-ups in the optimization of engineering design parameters by conducting related designs jointly in a multi-task manner.[26] In machine learning, the transfer of optimized features across related data sets can enhance the efficiency of the training process as well as improve the generalization capability of learned models.[34][35] In addition, the concept of multi-tasking has led to advances in automatic hyperparameter optimization of machine learning models and ensemble learning.[36][37]
Applications have also been reported in cloud computing,[38] with future developments geared towards cloud-based on-demand optimization services that can cater to multiple customers simultaneously.[17][39] Recent work has additionally shown applications in chemistry.[40] In addition, some recent works have applied multi-task optimization algorithms in industrial manufacturing.[41][42]
Mathematics
[edit]Reproducing Hilbert space of vector valued functions (RKHSvv)
[edit]The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.[7]
RKHSvv concepts
[edit]Suppose the training data set is , with , , where t indexes task, and . Let . In this setting there is a consistent input and output space and the same loss function for each task: . This results in the regularized machine learning problem:
| 1 |
where is a vector valued reproducing kernel Hilbert space with functions having components .
The reproducing kernel for the space of functions is a symmetric matrix-valued function , such that and the following reproducing property holds:
| 2 |
The reproducing kernel gives rise to a representer theorem showing that any solution to equation 1 has the form:
| 3 |
Separable kernels
[edit]The form of the kernel Γ induces both the representation of the feature space and structures the output across tasks. A natural simplification is to choose a separable kernel, which factors into separate kernels on the input space X and on the tasks . In this case the kernel relating scalar components and is given by . For vector valued functions we can write , where k is a scalar reproducing kernel, and A is a symmetric positive semi-definite matrix. Henceforth denote .
This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by A. Methods for non-separable kernels Γ is a current field of research.
For the separable case, the representation theorem is reduced to . The model output on the training data is then KCA , where K is the empirical kernel matrix with entries , and C is the matrix of rows .
With the separable kernel, equation 1 can be rewritten as
| P |
where V is a (weighted) average of L applied entry-wise to Y and KCA. (The weight is zero if is a missing observation).
Note the second term in P can be derived as follows:
Known task structure
[edit]Task structure representations
[edit]There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.
Regularizer—With the separable kernel, it can be shown (below) that , where is the element of the pseudoinverse of , and is the RKHS based on the scalar kernel , and . This formulation shows that controls the weight of the penalty associated with . (Note that arises from .)
Output metric—an alternative output metric on can be induced by the inner product . With the squared loss there is an equivalence between the separable kernels under the alternative metric, and , under the canonical metric.
Output mapping—Outputs can be mapped as to a higher dimensional space to encode complex structures such as trees, graphs and strings. For linear maps L, with appropriate choice of separable kernel, it can be shown that .
Task structure examples
[edit]Via the regularizer formulation, one can represent a variety of task structures easily.
- Letting (where is the TxT identity matrix, and is the TxT matrix of ones) is equivalent to letting Γ control the variance of tasks from their mean . For example, blood levels of some biomarker may be taken on T patients at time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
- Letting , where is equivalent to letting control the variance measured with respect to a group mean: . (Here the cardinality of group r, and is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
- Letting , where is the Laplacian for the graph with adjacency matrix M giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks t and s when they are more similar (according to the weight ,) i.e. regularizes .
- All of the above choices of A also induce the additional regularization term which penalizes complexity in f more broadly.
Learning tasks together with their structure
[edit]Learning problem P can be generalized to admit learning task matrix A as follows:
| Q |
Choice of must be designed to learn matrices A of a given type. See "Special cases" below.
Restricting to the case of convex losses and coercive penalties Ciliberto et al. have shown that although Q is not convex jointly in C and A, a related problem is jointly convex.
Specifically on the convex set , the equivalent problem
| R |
is convex with the same minimum value. And if is a minimizer for R then is a minimizer for Q.
R may be solved by a barrier method on a closed set by introducing the following perturbation:
| S |
The perturbation via the barrier forces the objective functions to be equal to on the boundary of .
S can be solved with a block coordinate descent method, alternating in C and A. This results in a sequence of minimizers in S that converges to the solution in R as , and hence gives the solution to Q.
Special cases
[edit]Spectral penalties - Dinnuzo et al[43] suggested setting F as the Frobenius norm . They optimized Q directly using block coordinate descent, not accounting for difficulties at the boundary of .
Clustered tasks learning - Jacob et al[44] suggested to learn A in the setting where T tasks are organized in R disjoint clusters. In this case let be the matrix with . Setting , and , the task matrix can be parameterized as a function of : , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation . In this formulation, .
Generalizations
[edit]Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.
Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.
Software package
[edit]A Matlab package called Multi-Task Learning via StructurAl Regularization (MALSAR) [45] implements the following multi-task learning algorithms: Mean-Regularized Multi-Task Learning,[46][47] Multi-Task Learning with Joint Feature Selection,[48] Robust Multi-Task Feature Learning,[49] Trace-Norm Regularized Multi-Task Learning,[50] Alternating Structural Optimization,[51][52] Incoherent Low-Rank and Sparse Learning,[53] Robust Low-Rank Multi-Task Learning, Clustered Multi-Task Learning,[54][55] Multi-Task Learning with Graph Structures.
Literature
[edit]- Multi-Target Prediction: A Unifying View on Problems and Methods Willem Waegeman, Krzysztof Dembczynski, Eyke Huellermeier https://arxiv.org/abs/1809.02352v1
See also
[edit]- Artificial intelligence
- Artificial neural network
- Automated machine learning (AutoML)
- Evolutionary computation
- Foundation model
- General game playing
- Human-based genetic algorithm
- Kernel methods for vector output
- Multiple-criteria decision analysis
- Multi-objective optimization
- Multicriteria classification
- Robot learning
- Transfer learning
- James–Stein estimator
References
[edit]- ^ Baxter, J. (2000). A model of inductive bias learning" Journal of Artificial Intelligence Research 12:149--198, On-line paper
- ^ Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. Paper at Citeseer
- ^ a b Caruana, R. (1997). "Multi-task learning" (PDF). Machine Learning. 28: 41–75. doi:10.1023/A:1007379606734.
- ^ Multi-Task Learning as Multi-Objective Optimization Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018), https://proceedings.neurips.cc/paper/2018/hash/432aca3a1e345e339f35a30c8f65edce-Abstract.html
- ^ Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.
- ^ Abu-Mostafa, Y. S. (1990). "Learning from hints in neural networks". Journal of Complexity. 6 (2): 192–198. doi:10.1016/0885-064x(90)90006-y.
- ^ a b c Ciliberto, C. (2015). "Convex Learning of Multiple Tasks and their Structure". arXiv:1504.03101 [cs.LG].
- ^ a b c d Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. arXiv:1810.09433
- ^ a b Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf
- ^ Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf
- ^ Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf
- ^ Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning, Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf
- ^ Szegedy, Christian; Wei Liu, Youssef; Yangqing Jia, Tomaso; Sermanet, Pierre; Reed, Scott; Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent; Rabinovich, Andrew (2015). "Going deeper with convolutions". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9. arXiv:1409.4842. doi:10.1109/CVPR.2015.7298594. ISBN 978-1-4673-6964-0. S2CID 206592484.
- ^ Roig, Gemma. "Deep Learning Overview" (PDF). Archived from the original (PDF) on 2016-03-06. Retrieved 2019-08-26.
- ^ Zweig, A. & Chechik, G. Group online adaptive learning. Machine Learning, DOI 10.1007/s10994-017- 5661-5, August 2017. http://rdcu.be/uFSv
- ^ a b Gupta, Abhishek; Ong, Yew-Soon; Feng, Liang (2018). "Insights on Transfer Optimization: Because Experience is the Best Teacher". IEEE Transactions on Emerging Topics in Computational Intelligence. 2 (1): 51–64. Bibcode:2018ITECI...2...51G. doi:10.1109/TETCI.2017.2769104. hdl:10356/147980. S2CID 11510470.
- ^ a b c Gupta, Abhishek; Ong, Yew-Soon; Feng, Liang (2016). "Multifactorial Evolution: Toward Evolutionary Multitasking". IEEE Transactions on Evolutionary Computation. 20 (3): 343–357. Bibcode:2016ITEC...20..343G. doi:10.1109/TEVC.2015.2458037. hdl:10356/148174. S2CID 13767012.
- ^ Pan, Sinno Jialin; Yang, Qiang (2010). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering. 22 (10): 1345–1359. Bibcode:2010ITKDE..22.1345P. doi:10.1109/TKDE.2009.191. S2CID 740063.
- ^ Caruana, R., "Multitask Learning", pp. 95-134 in Sebastian Thrun, Lorien Pratt (eds.) Learning to Learn, (1998) Springer ISBN 9780792380474
- ^ Cheng, Mei-Ying; Gupta, Abhishek; Ong, Yew-Soon; Ni, Zhi-Wei (2017). "Coevolutionary multitasking for concurrent global optimization: With case studies in complex engineering design". Engineering Applications of Artificial Intelligence. 64: 13–24. doi:10.1016/j.engappai.2017.05.008. S2CID 13767210.
- ^ Cabi, Serkan; Sergio Gómez Colmenarejo; Hoffman, Matthew W.; Denil, Misha; Wang, Ziyu; Nando de Freitas (2017). "The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously". arXiv:1707.03300 [cs.AI].
- ^ J. -Y. Li, Z. -H. Zhan, Y. Li and J. Zhang, "Multiple Tasks for Multiple Objectives: A New Multiobjective Optimization Method via Multitask Optimization," in IEEE Transactions on Evolutionary Computation, doi:10.1109/TEVC.2023.3294307
- ^ Standley, Trevor; Zamir, Amir R.; Chen, Dawn; Guibas, Leonidas; Malik, Jitendra; Savarese, Silvio (2020-07-13). "Learning the Pareto Front with Hypernetworks". International Conference on Machine Learning: 9120–9132. arXiv:1905.07553.
- ^ Swersky, K., Snoek, J., & Adams, R. P. (2013). Multi-task bayesian optimization. Advances in neural information processing systems (pp. 2004-2012).
- ^ Bonilla, E. V., Chai, K. M., & Williams, C. (2008). Multi-task Gaussian process prediction. Advances in neural information processing systems (pp. 153-160).
- ^ a b Ong, Y. S., & Gupta, A. (2016). Evolutionary multitasking: a computer science view of cognitive multitasking. Cognitive Computation, 8(2), 125-142.
- ^ Feng, Liang; Zhou, Lei; Zhong, Jinghui; Gupta, Abhishek; Ong, Yew-Soon; Tan, Kay-Chen; Qin, A. K. (2019). "Evolutionary Multitasking via Explicit Autoencoding". IEEE Transactions on Cybernetics. 49 (9): 3457–3470. Bibcode:2019ITCyb..49.3457F. doi:10.1109/TCYB.2018.2845361. PMID 29994415. S2CID 51613697.
- ^ Jiang, Yi; Zhan, Zhi-Hui; Tan, Kay Chen; Zhang, Jun (January 2024). "Block-Level Knowledge Transfer for Evolutionary Multitask Optimization". IEEE Transactions on Cybernetics. 54 (1): 558–571. Bibcode:2024ITCyb..54..558J. doi:10.1109/TCYB.2023.3273625. ISSN 2168-2267. PMID 37216256.
- ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. MIT Press. ISBN 978-0-262-03561-3.
- ^ Liu, L.; Li, Y.; Kuang, Z.; Xue, J.; Chen, Y.; Yang, W.; Liao, Q.; Zhang, W. (2021-05-04). "Towards Impartial Multi-task Learning". In: Proceedings of the International Conference on Learning Representations (ICLR 2021). ICLR: Virtual event. (2021). Retrieved 2022-11-20.
- ^ Tianhe, Yu; Saurabh, Kumar; Abhishek, Gupta; Sergey, Levine; Karol, Hausman; Chelsea, Finn (2020). "Gradient Surgery for Multi-Task Learning". Advances in Neural Information Processing Systems. 33. arXiv:2001.06782.
- ^ Liu, Bo; Liu, Xingchao; Jin, Xiaojie; Stone, Peter; Liu, Qiang (2021-10-26). "Conflict-Averse Gradient Descent for Multi-task Learning". arXiv:2110.14048 [cs.LG].
- ^ Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, Ethan Fetaya, (2022). Multi-Task Learning as a Bargaining Game. International conference on machine learning.
- ^ Chandra, R., Gupta, A., Ong, Y. S., & Goh, C. K. (2016, October). Evolutionary multi-task learning for modular training of feedforward neural networks. In International Conference on Neural Information Processing (pp. 37-46). Springer, Cham.
- ^ Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320-3328).
- ^ Wen, Yu-Wei; Ting, Chuan-Kang (2016). "Learning ensemble of decision trees through multifactorial genetic programming". 2016 IEEE Congress on Evolutionary Computation (CEC). pp. 5293–5300. doi:10.1109/CEC.2016.7748363. ISBN 978-1-5090-0623-6. S2CID 2617811.
- ^ Zhang, Boyu; Qin, A. K.; Sellis, Timos (2018). "Evolutionary feature subspaces generation for ensemble classification". Proceedings of the Genetic and Evolutionary Computation Conference. pp. 577–584. doi:10.1145/3205455.3205638. ISBN 978-1-4503-5618-3. S2CID 49564862.
- ^ Bao, Liang; Qi, Yutao; Shen, Mengqing; Bu, Xiaoxuan; Yu, Jusheng; Li, Qian; Chen, Ping (2018). "An Evolutionary Multitasking Algorithm for Cloud Computing Service Composition". Services – SERVICES 2018. Lecture Notes in Computer Science. Vol. 10975. pp. 130–144. doi:10.1007/978-3-319-94472-2_10. ISBN 978-3-319-94471-5.
- ^ Tang, J., Chen, Y., Deng, Z., Xiang, Y., & Joy, C. P. (2018). A Group-based Approach to Improve Multifactorial Evolutionary Algorithm. In IJCAI (pp. 3870-3876).
- ^ Felton, Kobi; Wigh, Daniel; Lapkin, Alexei (2021). "Multi-task Bayesian Optimization of Chemical Reactions". chemRxiv. doi:10.26434/chemrxiv.13250216.v2.
- ^ Jiang, Yi; Zhan, Zhi-Hui; Tan, Kay Chen; Zhang, Jun (October 2023). "A Bi-Objective Knowledge Transfer Framework for Evolutionary Many-Task Optimization". IEEE Transactions on Evolutionary Computation. 27 (5): 1514–1528. Bibcode:2023ITEC...27.1514J. doi:10.1109/TEVC.2022.3210783. ISSN 1089-778X.
- ^ Jiang, Yi; Zhan, Zhi-Hui; Tan, Kay Chen; Kwong, Sam; Zhang, Jun (2024). "Knowledge Structure Preserving-Based Evolutionary Many-Task Optimization". IEEE Transactions on Evolutionary Computation. 29 (2): 287–301. doi:10.1109/TEVC.2024.3355781. ISSN 1089-778X.
- ^ Dinuzzo, Francesco (2011). "Learning output kernels with block coordinate descent" (PDF). Proceedings of the 28th International Conference on Machine Learning (ICML-11). Archived from the original (PDF) on 2017-08-08.
- ^ Jacob, Laurent (2009). "Clustered multi-task learning: A convex formulation". Advances in Neural Information Processing Systems. arXiv:0809.2085. Bibcode:2008arXiv0809.2085J.
- ^ Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. On-line manual
- ^ Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).
- ^ Evgeniou, T.; Micchelli, C.; Pontil, M. (2005). "Learning multiple tasks with kernel methods" (PDF). Journal of Machine Learning Research. 6: 615.
- ^ Argyriou, A.; Evgeniou, T.; Pontil, M. (2008a). "Convex multi-task feature learning". Machine Learning. 73 (3): 243–272. doi:10.1007/s10994-007-5040-8.
- ^ Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning[dead link]. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.
- ^ Ji, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).
- ^ Ando, R.; Zhang, T. (2005). "A framework for learning predictive structures from multiple tasks and unlabeled data" (PDF). The Journal of Machine Learning Research. 6: 1817–1853.
- ^ Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).
- ^ Chen, J., Liu, J., & Ye, J. (2010). Learning incoherent sparse and low-rank patterns from multiple tasks. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).
- ^ Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems, 2008
- ^ Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. Advances in Neural Information Processing Systems.
External links
[edit]- The Biosignals Intelligence Group at UIUC
- Washington University in St. Louis Department of Computer Science
Software
[edit]- The Multi-Task Learning via Structural Regularization Package
- Online Multi-Task Learning Toolkit (OMT) A general-purpose online multi-task learning toolkit based on conditional random field models and stochastic gradient descent training (C#, .NET)
Multi-task learning
View on GrokipediaOverview
Definition and Core Concepts
Multi-task learning (MTL) is a subfield of machine learning in which a model is trained to simultaneously solve multiple related tasks, leveraging shared representations or parameters to exploit interdependencies among the tasks for improved performance and generalization.[4][5] In this paradigm, the model learns a unified representation from data across all tasks, allowing knowledge transfer that enhances the learning of each individual task compared to training them in isolation.[5] This approach contrasts with single-task learning, where separate models are developed independently for each task, potentially leading to redundant computations and missed opportunities for cross-task synergies. Core concepts in MTL revolve around shared representations, such as common feature extractors that capture underlying patterns beneficial to multiple tasks; auxiliary tasks, which serve as supportive problems to regularize the model and provide additional supervisory signals; and inductive bias derived from task relatedness, which guides the learner toward hypotheses that generalize better across the domain.[5] For instance, in a shared representation setup, an initial layer might extract general features like edges in images, which are then branched into task-specific heads for classification or regression, as opposed to fully independent models that duplicate such foundational learning.[4] This structure effectively acts as implicit data augmentation by amplifying the training signals through related tasks, increasing the effective sample size and reducing overfitting without requiring additional labeled data for the primary task.[6] MTL differs from related paradigms like transfer learning, where knowledge is sequentially transferred from a source task to a target task after pre-training, whereas MTL emphasizes joint training of all tasks from the outset to enable bidirectional knowledge sharing.[5] The foundational formalization of MTL traces back to Rich Caruana's 1997 work, which introduced the idea of using related tasks to impose a beneficial inductive bias, thereby improving generalization through the implicit augmentation of training data via cross-task signals.[4][6]Historical Development and Motivation
Multi-task learning (MTL) emerged in the late 1990s as a technique to enhance generalization in machine learning by jointly training models on multiple related tasks, drawing inspiration from how humans learn interconnected skills. The foundational work by Rich Caruana in 1997 formalized MTL, demonstrating its potential through shared representations in neural networks to leverage domain-specific information from auxiliary tasks, particularly in data-scarce environments.[7] Early efforts focused on shallow models and supervised learning paradigms, with key advancements including regularized formulations for feature sharing in the mid-2000s, such as those by Evgeniou and Pontil (2004), which used kernel methods to capture task correlations. The field experienced a resurgence in the 2010s, driven by the deep learning revolution following the success of convolutional neural networks around 2012. Researchers began integrating MTL into deep architectures, emphasizing shared encoders to exploit hierarchical representations across tasks; for instance, Misra et al. (2016) introduced cross-stitch units for adaptive parameter sharing in vision tasks, while Luong et al. (2016) applied shared encoders in sequence-to-sequence models for natural language processing.[8] This period also saw MTL's integration with transfer learning, exemplified by Taskonomy (Zamir et al., 2018), which pretrained models on diverse visual tasks to enable efficient downstream adaptation between 2015 and 2020. In the 2020s, MTL has evolved alongside foundation models, incorporating multimodal pretraining for vision-language tasks; notable examples include the 12-in-1 model by Lu et al. (2020), which unified multiple vision-and-language objectives, and extensions of CLIP-like architectures such as M2-CLIP (2024), which use adapters for multi-task video understanding.[9][10] Recent advances (2023–2025) emphasize scalable multimodal MTL in pretrained models like variants of Gemini and mPLUG-2, enabling joint learning across text, image, and video modalities for large-scale AI systems.[11] The primary motivations for MTL include improved generalization via shared inductive biases, reduced overfitting through auxiliary tasks, enhanced efficiency in low-data regimes, and scalability for complex systems; empirical studies from early benchmarks, such as those on sentiment classification and robot control, report relative error reductions of 10–20% compared to single-task learning. This evolution was propelled by the shift from shallow to deep neural networks post-2012, deeper integration with transfer learning, and the rise of foundation models handling multimodality. Early work was largely confined to supervised tasks, but by the 2020s, expansions to semi-supervised and reinforcement learning paradigms addressed these gaps, broadening MTL's applicability.[2]Methods
Task Relationship Modeling
Task relationship modeling in multi-task learning involves identifying and quantifying dependencies among tasks to guide the design of shared representations and avoid negative transfer. This foundational step enables the selective sharing of knowledge between similar tasks while isolating dissimilar ones, improving overall generalization. Approaches typically begin by analyzing task similarities through data-driven metrics, followed by clustering or subspace modeling to exploit overlaps. Seminal work in this area, such as the use of Dirichlet process priors for inferring task clusters, demonstrated that grouping related tasks can enhance predictive performance by capturing latent structures in task relatedness.[12] Task grouping methods cluster tasks based on similarity measures derived from task embeddings or correlation matrices, allowing joint training within clusters to leverage shared patterns. For instance, hierarchical clustering algorithms applied to multi-task settings, introduced around 2007, use nonparametric Bayesian models to automatically determine the number of clusters and assign tasks accordingly, as in the infinite relational model which treats tasks as nodes in a graph and infers cluster assignments via posterior sampling. These methods compute similarities from task outputs or gradients, grouping tasks with high correlation to form sub-networks for training. In practice, such clustering has been shown to reduce overfitting in datasets with heterogeneous tasks by limiting interference from unrelated groups.[12] Overlap exploitation techniques model shared subspaces between tasks using low-rank approximations of task covariances, assuming that related tasks lie in a low-dimensional manifold. A key approach regularizes the joint parameter matrix across tasks to enforce low-rank structure, capturing correlations via nuclear norm penalties on the covariance matrix of task predictors. This allows decomposition of task-specific parameters into shared low-rank components plus sparse individual deviations, effectively modeling subspace overlaps. For example, in computer vision, tasks like semantic segmentation and object detection exhibit overlap in feature representations for edge and region detection, where low-rank modeling groups them to share convolutional filters, leading to improved accuracy on benchmarks like PASCAL VOC over single-task baselines.[5] Strategies for handling unrelated or negatively correlated tasks treat them as regularizers to enhance robustness, preventing interference in joint optimization. In 2010s studies, including negatively correlated tasks in multi-task frameworks was found to act as implicit noise injection, improving generalization on held-out data in scenarios with task conflicts, as evidenced in regularization-based relation learning that assigns negative weights to dissimilar pairs. This approach uses adaptive penalties to downweight negative influences during training, ensuring that unrelated tasks contribute to variance reduction without dominating shared parameters. Evidence from synthetic and real-world datasets, such as gene expression prediction, shows that such inclusion mitigates overfitting in high-dimensional settings.[13][14][13] Metrics for task relationships include Gram matrix distances and mutual information, which quantify similarity without assuming specific model architectures. Gram matrix distances, derived from kernel methods, measure divergence between task covariance kernels as the Frobenius norm of their difference, providing a kernel-based similarity score. Mutual information, estimated via kernel density approximations, captures nonlinear dependencies between task outputs. The following pseudocode illustrates computing a correlation-based task similarity matrix, a precursor to these metrics:import numpy as np
def compute_task_similarity(task_outputs):
# task_outputs: list of arrays, each shape (n_samples, n_features) for a task
n_tasks = len(task_outputs)
similarity_matrix = np.zeros((n_tasks, n_tasks))
for i in range(n_tasks):
for j in range(i+1, n_tasks):
corr = np.corrcoef(task_outputs[i].flatten(), task_outputs[j].flatten())[0,1]
similarity_matrix[i,j] = similarity_matrix[j,i] = abs(corr) # Use absolute for grouping
return similarity_matrix
import numpy as np
def compute_task_similarity(task_outputs):
# task_outputs: list of arrays, each shape (n_samples, n_features) for a task
n_tasks = len(task_outputs)
similarity_matrix = np.zeros((n_tasks, n_tasks))
for i in range(n_tasks):
for j in range(i+1, n_tasks):
corr = np.corrcoef(task_outputs[i].flatten(), task_outputs[j].flatten())[0,1]
similarity_matrix[i,j] = similarity_matrix[j,i] = abs(corr) # Use absolute for grouping
return similarity_matrix
Knowledge Transfer Techniques
In multi-task learning, knowledge transfer techniques facilitate the sharing of learned representations and parameters across related tasks to improve generalization and efficiency. These methods, evolving from early regularization approaches in the 2010s to deep neural network adaptations, emphasize architectural designs that balance task independence and interdependence without requiring explicit task groupings. Parameter sharing is a foundational technique for knowledge transfer, where components of the model are jointly optimized to capture commonalities. Hard parameter sharing employs a shared "trunk" of layers, typically convolutional or feedforward, followed by task-specific heads, as introduced in early deep multi-task architectures such as multi-head networks around 2016. This approach reduces overfitting compared to task-independent models, particularly when tasks share low-level features like in vision tasks. Soft parameter sharing, in contrast, assigns separate parameter sets to each task but induces transfer via regularized constraints on parameter differences, allowing flexibility for loosely related tasks while promoting alignment. Architectures like cross-stitch networks exemplify this by learning task-specific combinations of shared activations, enhancing transfer without full parameter fusion. Regularization-based transfer enforces low-rank structures or predictive consistency across tasks to prevent negative interference. Trace norm regularization, a seminal method from the early 2010s, promotes low-rank weight matrices across tasks by penalizing the nuclear norm of task parameter concatenations, enabling sparse data regimes to leverage task correlations effectively. In deep variants, adaptations like gradient sign dropout extend traditional dropout by selectively masking gradients based on task relations, mitigating overfitting in multi-task settings during the 2010s shift to neural networks. Cross-task distillation further supports transfer by using predictions from one task as soft labels to guide another, as demonstrated in multi-task recommendation systems where auxiliary task outputs distill knowledge to primary tasks, improving convergence without additional data. Auxiliary task design involves introducing synthetic or proxy tasks to enrich representations for primary objectives, a strategy dating to the early 2000s but refined in deep learning. For instance, reconstruction tasks as auxiliaries for classification compel models to learn robust features by predicting input reconstructions alongside labels, boosting primary performance in speech recognition on benchmarks. Recent multimodal setups, such as hierarchical frameworks combining imaging and clinical data, use auxiliary cognitive prediction tasks to enhance main diagnostic goals in healthcare applications like disease prognosis. For non-stationary tasks where distributions evolve over time, continual multi-task learning employs replay buffers to mitigate catastrophic forgetting post-2018. These buffers store exemplars from prior tasks, replaying them during training on new tasks to preserve knowledge; methods like CLEAR use experience replay, significantly reducing forgetting in sequential benchmarks such as Atari games. Curiosity-driven variants further prioritize diverse buffer samples, supporting efficient adaptation in dynamic environments without full retraining. Recent advancements as of 2025 include integration with large foundation models for scalable continual learning in transformer-based architectures.[16]Optimization and Learning Paradigms
In multi-task learning, optimization typically involves minimizing a joint objective that combines losses from multiple tasks, often through a weighted sum to balance their contributions during training. The basic formulation employs static weights, but dynamic weighting schemes adapt these based on task-specific characteristics to prevent dominant tasks from overshadowing others. A prominent approach introduces task uncertainty as a learnable parameter, where the weight for each task's loss is inversely proportional to its homoscedastic uncertainty, modeled as for task , with optimized alongside model parameters via maximum likelihood estimation. This method, applied to scene geometry and semantics tasks, improves performance by automatically scaling losses according to their noise levels, achieving relative error reductions of up to 25% on depth estimation benchmarks compared to equal weighting.[16] The following pseudocode illustrates the forward pass and loss computation for uncertainty-weighted multi-task optimization in a neural network setting:for each batch in training data:
for each task i in tasks:
predictions_i = model(batch_inputs)[task_i]
loss_i = task_loss_i(predictions_i, batch_targets_i)
weighted_loss_i = loss_i / (2 * exp(log_sigma_i)) # Equivalent to 1/(2 sigma_i^2)
total_loss += weighted_loss_i
total_loss += regularization_on_log_sigmas # Penalize extreme uncertainties
optimizer.step(total_loss)
for each batch in training data:
for each task i in tasks:
predictions_i = model(batch_inputs)[task_i]
loss_i = task_loss_i(predictions_i, batch_targets_i)
weighted_loss_i = loss_i / (2 * exp(log_sigma_i)) # Equivalent to 1/(2 sigma_i^2)
total_loss += weighted_loss_i
total_loss += regularization_on_log_sigmas # Penalize extreme uncertainties
optimizer.step(total_loss)
Mathematical Foundations
General Problem Formulation
Multi-task learning (MTL) extends the single-task learning paradigm by jointly optimizing multiple related tasks to leverage shared information, improving generalization across all tasks. In single-task learning, the objective is to minimize a loss function , where represents the model parameters, is the empirical loss on task-specific data, and is a regularizer to prevent overfitting. MTL generalizes this to tasks by introducing shared parameters and task-specific components, formulating the problem as minimizing a composite loss , where is the average loss for task over its dataset , is a task-specific loss (e.g., squared error or cross-entropy), maps inputs to outputs for task , and are weights balancing task contributions (often set to 1 for equal weighting). This joint optimization assumes tasks share a common parameter space, extending scalar-valued functions (single output) to vector-valued mappings across tasks without assuming kernel structures.[20][7] The tasks in MTL are assumed to be related through a shared latent structure, such as common input features or underlying representations that capture domain-specific patterns across the tasks. Formally, each task defines an input-output mapping from to , but homogeneity is often imposed where for all to enable parameter sharing; heterogeneous cases align features via transformations. This relatedness is crucial, as unrelated tasks can lead to interference rather than transfer, but the formulation exploits correlations in the joint data distribution to induce a beneficial bias in .[20] Evaluation in MTL combines task-specific metrics, such as mean squared error for regression or accuracy for classification on held-out data per task, with MTL-specific measures like the avoidance of negative transfer, where performance on a target task degrades due to joint training with dissimilar tasks. To derive the role of weighting in optimization, consider the gradient of the composite loss: , which aggregates task gradients scaled by ; unbalanced gradients can cause dominant tasks to overshadow others, leading to suboptimal convergence. Techniques like dynamic weighting adjust to normalize gradient magnitudes, ensuring equitable updates across tasks and mitigating negative transfer.[21]Vector-Valued Reproducing Kernel Hilbert Spaces
Vector-valued reproducing kernel Hilbert spaces (RKHS) provide a functional analytic framework for multi-task learning (MTL) by extending scalar-valued RKHS to handle vector-valued outputs, enabling the modeling of multiple related tasks within a single Hilbert space of functions. Formally, a vector-valued RKHS consists of functions , where is the input space and denotes the number of tasks, equipped with a matrix-valued kernel that is positive semi-definite, meaning for any , points , and vectors , the inequality holds. The kernel induces an inner product on such that the space is complete, and the reproducing property states that for any , , and , , allowing point evaluations via inner products with kernel sections.[22][23] A common construction for vector-valued kernels in MTL is the separable kernel, particularly when tasks share identical input structures, given by , where is a positive definite scalar kernel (e.g., Gaussian or linear) and is the identity matrix. This form assumes task independence in the output space while leveraging shared input representations, leading to an RKHS where functions decompose as with each , the scalar RKHS induced by . The eigenvalue decomposition of the scalar kernel facilitates analysis; for instance, the Mercer decomposition extends to the vector-valued case, yielding an orthonormal basis for with eigenvalues , which simplifies regularization and bounds on function norms. More general separable kernels incorporate task correlations via , where is a fixed task covariance matrix, capturing prior beliefs about task relatedness.[24] To incorporate known task structures, such as prior covariances between tasks, vector-valued kernels can be designed using sums of separable forms, , where each models a latent factor of task correlation and are scalar kernels. This aligns with the linear model of coregionalization, where task outputs are linear combinations of shared latent functions. Kronecker products further enable efficient kernel construction, particularly for the Gram matrix over training points , as , reducing inversion costs from to via eigendecomposition of and . Such designs allow encoding domain knowledge, like hierarchical task relations, directly into the kernel operator.[24][23] Learning in vector-valued RKHS involves joint optimization over functions and task relations, typically minimizing empirical risk plus a regularizer, such as , where is a vector-valued loss (e.g., squared error) and controls complexity. The vector-valued representer theorem guarantees that the solution lies in the span of kernel sections: for coefficients , reducing the infinite-dimensional problem to finite-dimensional linear algebra, solvable via kernel matrix inversion. This enables structured MTL by estimating task covariances alongside , often through alternating optimization or Bayesian inference in Gaussian process views.[22][23] Despite these advantages, vector-valued RKHS face limitations in scalability, particularly for large , as the kernel matrix scales as , leading to time for regularized least squares, which becomes prohibitive beyond moderate task counts (e.g., ). Approximations developed in the 2010s, such as low-rank factorizations of or Nyström methods for the input kernel, mitigate this by reducing effective dimensionality while preserving reproducing properties.[25][24]Applications
Computer Vision and Image Processing
Multi-task learning (MTL) has been extensively applied in computer vision and image processing to jointly address interrelated tasks such as object detection and semantic segmentation, leveraging shared feature representations to enhance overall performance. A prominent use case is the integration of object detection and instance segmentation, as exemplified by Mask R-CNN, which extends Faster R-CNN by adding a mask prediction branch to perform both bounding box detection and pixel-level segmentation in a single forward pass.[26] Variants of Mask R-CNN, developed since 2017, have further optimized this joint learning for diverse scenarios, including real-time applications in autonomous driving and medical imaging, where simultaneous detection and segmentation reduce computational overhead compared to separate models.[27] Another key application involves semantic segmentation combined with depth estimation, enabling holistic scene understanding; for instance, QuadroNet employs MTL to jointly predict 2D object detection, semantic segmentation, depth estimation, and surface normals from monocular images, achieving real-time performance on edge hardware.[28] In terms of architectures, MTL in vision often relies on shared convolutional neural network (CNN) backbones followed by task-specific heads to extract common low-level features like edges and textures while allowing specialization for higher-level tasks. This hard-sharing approach, where the backbone parameters are jointly optimized across tasks, has been formalized in frameworks that demonstrate improved generalization over single-task models.[29] Adaptations of multi-task deep neural networks (MT-DNN) originally from natural language processing have evolved for vision tasks, incorporating transformer-based backbones since 2019 to handle multi-modal inputs; recent developments, such as vision transformer adapters, enable generalizable MTL by learning task affinities that transfer to unseen vision domains like medical and remote sensing imagery.[30] By 2025, these evolutions include weighted vision transformers that balance task contributions dynamically, supporting efficient joint training for segmentation and classification in resource-constrained environments.[31] The benefits of MTL in this domain are particularly evident in medical imaging, where shared representations lead to parameter efficiency and faster inference without sacrificing accuracy. For example, on the CheXpert dataset—a large collection of 224,316 chest radiographs with uncertainty labels for 14 pathologies—multi-label classification models trained via MTL outperform single-task baselines by exploiting hierarchical disease dependencies, achieving higher AUC scores while reducing the need for task-specific fine-tuning.[32] In multi-disease classification tasks, MTL has been demonstrated in semi-supervised approaches that leverage auxiliary tasks like segmentation to boost classification on limited labeled data.[33] Recent lightweight MTL models for edge devices, such as those presented at WACV 2025, further enable deployment on mobile hardware for real-time vision tasks; for instance, multi-task supervised compression models reduce computational requirements while maintaining or improving detection accuracy on benchmarks like COCO, facilitating applications in portable medical diagnostics and autonomous systems.[34] Despite these advantages, MTL in vision faces challenges like negative transfer, where optimizing for one task (e.g., high-resolution segmentation) degrades performance on another (e.g., low-level depth estimation) in diverse scenes such as varying lighting or occlusions. This issue arises from conflicting gradients in shared backbones.[35] Mitigation strategies include dynamic weighting of task losses during training, such as scaling by exponential moving averages of validation losses to prioritize beneficial tasks and suppress harmful ones, which has been shown to improve convergence and final performance in vision benchmarks like Cityscapes.[36] Lightweight transformer-based MTL models incorporate such dynamic re-weighting to adapt to scene variability, ensuring robust transfer across indoor-outdoor environments without extensive retraining.[37]Natural Language Processing
Multi-task learning (MTL) in natural language processing (NLP) has prominently featured joint modeling of interrelated linguistic tasks, such as multi-label text classification, machine translation combined with summarization, and question answering paired with natural language inference. The GLUE benchmark, introduced in 2018, exemplifies this by aggregating nine diverse NLU tasks—including sentiment analysis for multi-label classification, natural language inference for entailment (e.g., MNLI and QQP datasets), and question answering (e.g., QNLI and SQuAD)—to evaluate models' ability to share linguistic knowledge across limited-data scenarios.[38] These tasks highlight MTL's role in capturing shared semantic and syntactic structures, enabling models to generalize better than single-task training on benchmarks like GLUE.[38] Modern MTL approaches in NLP leverage pretrained transformer models, reformulating tasks into unified formats for joint training. The T5 model, released in 2019, pioneered text-to-text transfer learning by treating all NLP tasks as text generation problems, allowing multi-task fine-tuning on datasets like translation (e.g., English-to-German) and summarization (e.g., CNN/Daily Mail), where task-specific prefixes guide the shared encoder-decoder architecture.[39] Its multilingual extension, mT5 from 2020, extends this to over 100 languages, supporting MTL for cross-lingual tasks such as translation and classification by pretraining on massively diverse corpora, achieving robust zero-shot transfer. From 2020 to 2025, these foundation models have integrated prompt-based alignment to enhance efficiency, as seen in frameworks like CrossPT (2025), which decomposes prompts into shared and task-specific components via attention mechanisms, improving cross-task transfer on GLUE-like benchmarks in low-resource settings.[40] The MTFPA framework (2025) further advances prompt-based MTL by hybrid alignment of task prompts, though primarily demonstrated in vision-language contexts adaptable to NLP.[41] Performance gains from MTL in NLP often stem from shared embeddings that capture common linguistic features, yielding improvements of 5-10% on downstream tasks like classification and inference by mitigating overfitting and leveraging auxiliary data.[42] Recent 2024-2025 studies on interleaved MTL, such as optimizing dataset combinations for large language models, report enhanced efficiency and up to 8% gains in biomedical NLP tasks (e.g., named entity recognition and relation extraction) through iterative selection of synergistic task mixtures, reducing training costs while boosting generalization.[43] MTL in NLP has evolved from sequence-based models like early transformers to post-2022 multimodal hybrids integrating text with vision, as in large multimodal models that jointly process textual inference and visual question answering for richer representations.[44] This shift enables hybrid tasks, such as captioning with entailment verification, building on shared encoders from T5-like architectures to handle interleaved text-image data streams.[44]Other Domains and Emerging Uses
Multi-task learning (MTL) has found significant applications in scientific domains, particularly in bioinformatics for predicting protein-protein interactions (PPIs). A 2025 review highlights how deep learning advancements, including MTL frameworks, have improved PPI prediction accuracy by leveraging shared representations across interaction types from 2021 to 2025 data. For instance, the DeepPFP architecture employs MTL to simultaneously predict multiple protein functions, achieving superior performance over single-task baselines on benchmarks like CAFA3 by integrating evolutionary and structural features. Similarly, the MPBind model uses MTL to forecast binding sites for diverse partners such as proteins and DNA, demonstrating enhanced generalization in multi-partner interaction tasks.[45][46][47] In sensor-based anomaly detection, MTL addresses challenges in industrial monitoring by jointly modeling normal and anomalous patterns across heterogeneous sensors. The MLAD framework, introduced in 2025, clusters sensors via time-series analysis and applies cluster-constrained graph neural networks for representation learning, followed by multi-task anomaly scoring; it outperforms baselines like isolation forests by up to 15% in F1-score on datasets such as SWaT and WADI. This approach enables efficient detection in cyber-physical systems by sharing knowledge between clustering and detection tasks.[48] Industrial applications of MTL extend to recommendation systems, where it enhances user profiling by jointly optimizing personalization and preference modeling. A 2025 framework for joint group profiling and recommendation uses deep neural MTL to infer group behaviors from individual interactions, improving click-through rate predictions by 8-12% on real-world e-commerce data compared to independent models. In short-video platforms, user behavior-aware MTL integrates viewing history and engagement signals across tasks like click prediction and retention, yielding a 10% uplift in recommendation diversity.[49][50] In robotics, MTL facilitates perception-action integration for complex environments, particularly post-2019 advancements in autonomous systems. For vision-based driving, the FASNet model (2020) employs MTL with future state predictions to handle tasks like lane detection and trajectory forecasting, reducing collision risks by 20% in simulated urban scenarios over single-task networks. More recent work on robotic manipulators (2025) uses MTL in reinforcement learning to share policies across grasping and navigation, accelerating convergence by 30% in multi-task benchmarks like RLBench. These methods enable robots to transfer skills from perception to action, improving adaptability in dynamic settings.[51][52] Emerging trends in MTL involve its integration with foundation models for multimodal data processing, emphasizing interleaved paradigms since 2024. Multimodal task vectors (MTVs) enable many-shot learning in interleaved large multimodal models like QwenVL by aligning vision-language tasks, boosting zero-shot performance on benchmarks such as VQA by 15% through shared embeddings. A 2024 interfacing approach for foundation models creates interleaved shared spaces via multi-task multi-modal training, allowing seamless extension to new modalities with minimal fine-tuning. In sustainability, MTL enhances climate modeling by jointly predicting variables like precipitation and temperature; a 2022 MTL-NET model forecasts the Indian Ocean Dipole up to seven months ahead, surpassing dynamical models like CFSv2 in correlation scores by 0.1-0.2. Similarly, a 2023 MTL framework retrieves passive microwave precipitation and land surface temperature simultaneously, improving retrieval accuracy by 5-10% over univariate methods on GPM datasets.[53][54][55][56] Case studies in healthcare diagnostics illustrate MTL's efficiency gains for multi-disease prediction from imaging and text. A 2023 large image-text (LIT) model for CT scans uses MTL to jointly diagnose lung conditions by fusing radiological reports with images. In chronic disease prediction, a 2025 multimodal MTL network processes electronic health records and imaging for tasks like diabetes and cardiovascular risk assessment, achieving high AUC scores (e.g., 0.89 for diabetes) comparable to single-task models while leveraging multimodal data across nationwide cohorts. These applications highlight MTL's role in scalable diagnostics, with shared encoders enabling 20% faster inference in resource-constrained settings.[57][58]Implementations
Software Libraries and Frameworks
Several open-source software libraries and frameworks facilitate the implementation of multi-task learning (MTL), providing tools for shared representations, task-specific heads, and joint optimization across classical machine learning and deep learning paradigms. These libraries emphasize modularity to support custom architectures while handling common MTL challenges like task imbalance through weighted losses and dynamic sampling.[59][60] In the classical machine learning domain, scikit-multilearn offers a scikit-learn-compatible module for multi-label classification, which extends to MTL scenarios by treating tasks as interdependent labels. It supports algorithms like classifier chains and label powerset for joint prediction, leveraging sparse matrices for efficiency on large datasets. For instance, a basic setup involves wrapping a base estimator:from skmultilearn.problem_transform import ClassifierChains
from sklearn.ensemble import RandomForestClassifier
base_estimator = RandomForestClassifier()
model = ClassifierChains(base_estimator)
model.fit(X_train, y_train) # y_train as multi-label matrix
from skmultilearn.problem_transform import ClassifierChains
from sklearn.ensemble import RandomForestClassifier
base_estimator = RandomForestClassifier()
model = ClassifierChains(base_estimator)
model.fit(X_train, y_train) # y_train as multi-label matrix
import torch
import torch.nn as nn
from libmtl import Trainer
class SharedEncoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU())
self.task_heads = nn.ModuleDict({
'task1': nn.Linear(128, 10),
'task2': nn.Linear(128, 2)
})
def forward(self, x):
features = self.encoder(x)
return {task: head(features) for task, head in self.task_heads.items()}
model = SharedEncoder()
trainer = Trainer(model, tasks=['task1', 'task2'], weight='uw') # Uncertainty weighting
import torch
import torch.nn as nn
from libmtl import Trainer
class SharedEncoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU())
self.task_heads = nn.ModuleDict({
'task1': nn.Linear(128, 10),
'task2': nn.Linear(128, 2)
})
def forward(self, x):
features = self.encoder(x)
return {task: head(features) for task, head in self.task_heads.items()}
model = SharedEncoder()
trainer = Trainer(model, tasks=['task1', 'task2'], weight='uw') # Uncertainty weighting
import tensorflow as tf
from tensorflow import [keras](/page/Keras)
inputs = [keras](/page/Keras).Input(shape=(784,))
shared = [keras](/page/Keras).layers.Dense(128, activation='relu')(inputs)
task1 = [keras](/page/Keras).layers.Dense(10, name='task1')(shared)
task2 = [keras](/page/Keras).layers.Dense(2, name='task2')(shared)
model = keras.Model(inputs=inputs, outputs=[task1, task2])
model.compile(optimizer='adam',
loss={'task1': 'sparse_categorical_crossentropy', 'task2': 'binary_crossentropy'},
loss_weights={'task1': 1.0, 'task2': 0.5})
import tensorflow as tf
from tensorflow import [keras](/page/Keras)
inputs = [keras](/page/Keras).Input(shape=(784,))
shared = [keras](/page/Keras).layers.Dense(128, activation='relu')(inputs)
task1 = [keras](/page/Keras).layers.Dense(10, name='task1')(shared)
task2 = [keras](/page/Keras).layers.Dense(2, name='task2')(shared)
model = keras.Model(inputs=inputs, outputs=[task1, task2])
model.compile(optimizer='adam',
loss={'task1': 'sparse_categorical_crossentropy', 'task2': 'binary_crossentropy'},
loss_weights={'task1': 1.0, 'task2': 0.5})

