Recent from talks
Contribute something
Nothing was collected or created yet.
Virtual screening
View on Wikipedia

Virtual screening (VS) is a computational technique used in drug discovery to search libraries of small molecules in order to identify those structures which are most likely to bind to a drug target, typically a protein receptor or enzyme.[2][3]
Virtual screening has been defined as "automatically evaluating very large libraries of compounds" using computer programs.[4] As this definition suggests, VS has largely been a numbers game focusing on how the enormous chemical space of over 1060 conceivable compounds[5] can be filtered to a manageable number that can be synthesized, purchased, and tested. Although searching the entire chemical universe may be a theoretically interesting problem, more practical VS scenarios focus on designing and optimizing targeted combinatorial libraries and enriching libraries of available compounds from in-house compound repositories or vendor offerings. As the accuracy of the method has increased, virtual screening has become an integral part of the drug discovery process.[6][1] Virtual Screening can be used to select in house database compounds for screening, choose compounds that can be purchased externally, and to choose which compound should be synthesized next.
Methods
[edit]There are two broad categories of screening techniques: ligand-based and structure-based.[7] The remainder of this page will reflect Figure 1, the flow chart of virtual screening.
Ligand-based methods
[edit]Given a set of structurally diverse ligands that binds to a receptor, a model of the receptor can be built by exploiting the collective information contained in such set of ligands. Different computational techniques explore the structural, electronic, molecular shape, and physicochemical similarities of different ligands that could imply their mode of action against a specific molecular receptor or cell lines.[8] A candidate ligand can then be compared to the pharmacophore model to determine whether it is compatible with it and therefore likely to bind.[9] Different 2D chemical similarity analysis methods[10] have been used to scan a databases to find active ligands. Another popular approach used in ligand-based virtual screening consist on searching molecules with shape similar to that of known actives, as such molecules will fit the target's binding site and hence will be likely to bind the target. There are a number of prospective applications of this class of techniques in the literature.[11][12][13] Pharmacophoric extensions of these 3D methods are also freely-available as webservers.[14][15] Also shape based virtual screening has gained significant popularity.[16]
Structure-based methods
[edit]Structure-based virtual screening approach includes different computational techniques that consider the structure of the receptor that is the molecular target of the investigated active ligands. Some of these techniques include molecular docking, structure-based pharmacophore prediction, and molecular dynamics simulations.[17][18][8] Molecular docking is the most used structure-based technique, and it applies a scoring function to estimate the fitness of each ligand against the binding site of the macromolecular receptor, helping to choose the ligands with the most high affinity.[19][20][21] Currently, there are some webservers oriented to prospective virtual screening.[22][23]
Hybrid methods
[edit]Hybrid methods that rely on structural and ligand similarity were also developed to overcome the limitations of traditional VLS approaches. This methodologies utilizes evolution‐based ligand‐binding information to predict small-molecule binders[24][25] and can employ both global structural similarity and pocket similarity.[24] A global structural similarity based approach employs both an experimental structure or a predicted protein model to find structural similarity with proteins in the PDB holo‐template library. Upon detecting significant structural similarity, 2D fingerprint based Tanimoto coefficient metric is applied to screen for small-molecules that are similar to ligands extracted from selected holo PDB templates.[26][27] The predictions from this method have been experimentally assessed and shows good enrichment in identifying active small molecules.
The above specified method depends on global structural similarity and is not capable of a priori selecting a particular ligand‐binding site in the protein of interest. Further, since the methods rely on 2D similarity assessment for ligands, they are not capable of recognizing stereochemical similarity of small-molecules that are substantially different but demonstrate geometric shape similarity. To address these concerns, a new pocket centric approach, PoLi, capable of targeting specific binding pockets in holo‐protein templates, was developed and experimentally assessed.
Computing infrastructure
[edit]The computation of pair-wise interactions between atoms, which is a prerequisite for the operation of many virtual screening programs, scales by , N is the number of atoms in the system. Due to the quadratic scaling, the computational costs increase quickly.
Ligand-based approach
[edit]Ligand-based methods typically require a fraction of a second for a single structure comparison operation. Sometimes a single CPU is enough to perform a large screening within hours. However, several comparisons can be made in parallel in order to expedite the processing of a large database of compounds.
Structure-based approach
[edit]The size of the task requires a parallel computing infrastructure, such as a cluster of Linux systems, running a batch queue processor to handle the work, such as Sun Grid Engine or Torque PBS.
A means of handling the input from large compound libraries is needed. This requires a form of compound database that can be queried by the parallel cluster, delivering compounds in parallel to the various compute nodes. Commercial database engines may be too ponderous, and a high speed indexing engine, such as Berkeley DB, may be a better choice. Furthermore, it may not be efficient to run one comparison per job, because the ramp up time of the cluster nodes could easily outstrip the amount of useful work. To work around this, it is necessary to process batches of compounds in each cluster job, aggregating the results into some kind of log file. A secondary process, to mine the log files and extract high scoring candidates, can then be run after the whole experiment has been run.
Accuracy
[edit]The aim of virtual screening is to identify molecules of novel chemical structure that bind to the macromolecular target of interest. Thus, success of a virtual screen is defined in terms of finding interesting new scaffolds rather than the total number of hits. Interpretations of virtual screening accuracy should, therefore, be considered with caution. Low hit rates of interesting scaffolds are clearly preferable over high hit rates of already known scaffolds.
Most tests of virtual screening studies in the literature are retrospective. In these studies, the performance of a VS technique is measured by its ability to retrieve a small set of previously known molecules with affinity to the target of interest (active molecules or just actives) from a library containing a much higher proportion of assumed inactives or decoys. There are several distinct ways to select decoys by matching the properties of the corresponding active molecule[28] and more recently decoys are also selected in a property-unmatched manner.[29] The actual impact of decoy selection, either for training or testing purposes, has also been discussed.[29][30]
By contrast, in prospective applications of virtual screening, the resulting hits are subjected to experimental confirmation (e.g., IC50 measurements). There is consensus that retrospective benchmarks are not good predictors of prospective performance and consequently only prospective studies constitute conclusive proof of the suitability of a technique for a particular target.[31][32][33][34][35]
Application to drug discovery
[edit]Virtual screening is a very useful application when it comes to identifying hit molecules as a beginning for medicinal chemistry. As the virtual screening approach begins to become a more vital and substantial technique within the medicinal chemistry industry the approach has had an expeditious increase.[36]
Ligand-based methods
[edit]While not knowing the structure trying to predict how the ligands will bind to the receptor. With the use of pharmacophore features each ligand identified donor, and acceptors. Equating features are overlaid, however given it is unlikely there is a single correct solution.[1]
Pharmacophore models
[edit]This technique is used when merging the results of searches by using unlike reference compounds, same descriptors and coefficient, but different active compounds. This technique is beneficial because it is more efficient than just using a single reference structure along with the most accurate performance when it comes to diverse actives.[1]
Pharmacophore is an ensemble of steric and electronic features that are needed to have an optimal supramolecular interaction or interactions with a biological target structure in order to precipitate its biological response. Choose a representative as a set of actives, most methods will look for similar bindings.[37] It is preferred to have multiple rigid molecules and the ligands should be diversified, in other words ensure to have different features that don't occur during the binding phase.[1]
Shape-based virtual screening
[edit]Shape-based molecular similarity approaches have been established as important and popular virtual screening techniques. At present, the highly optimized screening platform ROCS (Rapid Overlay of Chemical Structures) is considered the de facto industry standard for shape-based, ligand-centric virtual screening.[38][39][40] It uses a Gaussian function to define molecular volumes of small organic molecules. The selection of the query conformation is less important, rendering shape-based screening ideal for ligand-based modeling: As the availability of a bioactive conformation for the query is not the limiting factor for screening — it is more the selection of query compound(s) that is decisive for screening performance.[16] Other shape-based molecular similarity methods such as Autodock-SS have also been developed.[41]
Field-based virtual screening
[edit]As an improvement to shape-based similarity methods, field-based methods try to take into account all the fields that influence a ligand-receptor interaction while being agnostic of the chemical structure used as a query. Various other fields are used in these methods, such as electrostatic or hydrophobic fields.[42][43]
Quantitative-structure activity relationship
[edit]Quantitative-structure activity relationship (QSAR) models consist of predictive models based on information extracted from a set of known active and known inactive compounds.[44] SAR's (structure activity relationship) where data is treated qualitatively and can be used with structural classes and more than one binding mode. Models prioritize compounds for lead discovery.[1]
Machine learning algorithms
[edit]Machine learning algorithms have been widely used in virtual screening approaches. Supervised learning techniques use a training and test datasets composed of known active and known inactive compounds. Different ML algorithms have been applied with success in virtual screening strategies, such as recursive partitioning, support vector machines, random forest, k-nearest neighbors and neural networks.[45][46][47] These models find the probability that a compound is active and then ranking each compound based on its probability.[1]
Substructural analysis in machine learning
[edit]The first machine learning model used on large datasets is the substructure analysis that was created in 1973. Each fragment substructure make a continuous contribution an activity of specific type.[1] Substructure is a method that overcomes the difficulty of massive dimensionality when it comes to analyzing structures in drug design. An efficient substructure analysis is used for structures that have similarities to a multi-level building or tower. Geometry is used for numbering boundary joints for a given structure in the onset and towards the climax. When the method of special static condensation and substitutions routines are developed this method is proved to be more productive than the previous substructure analysis models.[48]
Recursive partitioning
[edit]Recursively partitioning is method that creates a decision tree using qualitative data. Understanding the way rules break classes up with a low error of misclassification while repeating each step until no sensible splits can be found. However, recursive partitioning can have poor prediction ability potentially creating fine models at the same rate.[1]
Structure-based methods known protein ligand docking
[edit]Ligand can bind into an active site within a protein by using a docking search algorithm, and scoring function in order to identify the most likely cause for an individual ligand while assigning a priority order.[1][49]
See also
[edit]References
[edit]- ^ a b c d e f g h i j Gillet V (2013). "Ligand-Based and Structure-Based Virtual Screening" (PDF). The University of Sheffield.
- ^ Rester U (July 2008). "From virtuality to reality - Virtual screening in lead discovery and lead optimization: a medicinal chemistry perspective". Current Opinion in Drug Discovery & Development. 11 (4): 559–68. PMID 18600572.
- ^ Rollinger JM, Stuppner H, Langer T (2008). "Virtual screening for the discovery of bioactive natural products". Natural Compounds as Drugs Volume I. Progress in Drug Research. Fortschritte der Arzneimittelforschung. Progrès des Recherches Pharmaceutiques. Vol. 65. pp. 211, 213–49. doi:10.1007/978-3-7643-8117-2_6. ISBN 978-3-7643-8098-4. PMC 7124045. PMID 18084917.
- ^ Walters WP, Stahl MT, Murcko MA (1998). "Virtual screening – an overview". Drug Discov. Today. 3 (4): 160–178. doi:10.1016/S1359-6446(97)01163-X.
- ^ Bohacek RS, McMartin C, Guida WC (1996). "The art and practice of structure-based drug design: a molecular modeling perspective". Med. Res. Rev. 16 (1): 3–50. doi:10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO;2-6. PMID 8788213.
- ^ McGregor MJ, Luo Z, Jiang X (June 11, 2007). "Chapter 3: Virtual screening in drug discovery". In Huang Z (ed.). Drug Discovery Research. New Frontiers in the Post-Genomic Era. Wiley-VCH: Weinheim, Germany. pp. 63–88. ISBN 978-0-471-67200-5.
- ^ McInnes C (October 2007). "Virtual screening strategies in drug discovery". Current Opinion in Chemical Biology. 11 (5): 494–502. doi:10.1016/j.cbpa.2007.08.033. PMID 17936059.
- ^ a b Santana K, do Nascimento LD, Lima e Lima A, Damasceno V, Nahum C, Braga RC, et al. (2021-04-29). "Applications of Virtual Screening in Bioprospecting: Facts, Shifts, and Perspectives to Explore the Chemo-Structural Diversity of Natural Products". Frontiers in Chemistry. 9 662688. Bibcode:2021FrCh....9..155S. doi:10.3389/fchem.2021.662688. ISSN 2296-2646. PMC 8117418. PMID 33996755.
- ^ Sun H (2008). "Pharmacophore-based virtual screening". Current Medicinal Chemistry. 15 (10): 1018–24. doi:10.2174/092986708784049630. PMID 18393859.
- ^ Willet P, Barnard JM, Downs GM (1998). "Chemical similarity searching". Journal of Chemical Information and Computer Sciences. 38 (6): 983–996. CiteSeerX 10.1.1.453.1788. doi:10.1021/ci9800211.
- ^ Rush TS, Grant JA, Mosyak L, Nicholls A (March 2005). "A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction". Journal of Medicinal Chemistry. 48 (5): 1489–95. CiteSeerX 10.1.1.455.4728. doi:10.1021/jm040163o. PMID 15743191.
- ^ Ballester PJ, Westwood I, Laurieri N, Sim E, Richards WG (February 2010). "Prospective virtual screening with Ultrafast Shape Recognition: the identification of novel inhibitors of arylamine N-acetyltransferases". Journal of the Royal Society, Interface. 7 (43): 335–42. doi:10.1098/rsif.2009.0170. PMC 2842611. PMID 19586957.
- ^ Kumar A, Zhang KY (2018). "Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery". Frontiers in Chemistry. 6 315. Bibcode:2018FrCh....6..315K. doi:10.3389/fchem.2018.00315. PMC 6068280. PMID 30090808.
- ^ Li H, Leung KS, Wong MH, Ballester PJ (July 2016). "USR-VS: a web server for large-scale prospective virtual screening using ultrafast shape recognition techniques". Nucleic Acids Research. 44 (W1): W436–41. doi:10.1093/nar/gkw320. PMC 4987897. PMID 27106057.
- ^ Sperandio O, Petitjean M, Tuffery P (July 2009). "wwLigCSRre: a 3D ligand-based server for hit identification and optimization". Nucleic Acids Research. 37 (Web Server issue): W504–9. doi:10.1093/nar/gkp324. PMC 2703967. PMID 19429687.
- ^ a b Kirchmair J, Distinto S, Markt P, Schuster D, Spitzer GM, Liedl KR, et al. (2009). "How To Optimize Shape-Based Virtual Screening: Choosing the Right Query and Including Chemical Information". Journal of Chemical Information and Modeling. 49 (3): 678–692. doi:10.1021/ci8004226. PMID 19434901.
- ^ Toledo Warshaviak D, Golan G, Borrelli KW, Zhu K, Kalid O (July 2014). "Structure-based virtual screening approach for discovery of covalently bound ligands". Journal of Chemical Information and Modeling. 54 (7): 1941–50. doi:10.1021/ci500175r. PMID 24932913.
- ^ Maia EH, Assis LC, de Oliveira TA, da Silva AM, Taranto AG (2020-04-28). "Structure-Based Virtual Screening: From Classical to Artificial Intelligence". Frontiers in Chemistry. 8 343. Bibcode:2020FrCh....8..343M. doi:10.3389/fchem.2020.00343. PMC 7200080. PMID 32411671.
- ^ Kroemer RT (August 2007). "Structure-based drug design: docking and scoring". Current Protein & Peptide Science. 8 (4): 312–28. CiteSeerX 10.1.1.225.959. doi:10.2174/138920307781369382. PMID 17696866.
- ^ Cavasotto CN, Orry AJ (2007). "Ligand docking and structure-based virtual screening in drug discovery". Current Topics in Medicinal Chemistry. 7 (10): 1006–14. doi:10.2174/156802607780906753. PMID 17508934.
- ^ Kooistra AJ, Vischer HF, McNaught-Flores D, Leurs R, de Esch IJ, de Graaf C (June 2016). "Function-specific virtual screening for GPCR ligands using a combined scoring method". Scientific Reports. 6 28288. Bibcode:2016NatSR...628288K. doi:10.1038/srep28288. PMC 4919634. PMID 27339552.
- ^ Irwin JJ, Shoichet BK, Mysinger MM, Huang N, Colizzi F, Wassam P, et al. (September 2009). "Automated docking screens: a feasibility study". Journal of Medicinal Chemistry. 52 (18): 5712–20. doi:10.1021/jm9006966. PMC 2745826. PMID 19719084.
- ^ Li H, Leung KS, Ballester PJ, Wong MH (2014-01-24). "istar: a web platform for large-scale protein-ligand docking". PLOS ONE. 9 (1) e85678. Bibcode:2014PLoSO...985678L. doi:10.1371/journal.pone.0085678. PMC 3901662. PMID 24475049.
- ^ a b Zhou H, Skolnick J (January 2013). "FINDSITE(comb): a threading/structure-based, proteomic-scale virtual ligand screening approach". Journal of Chemical Information and Modeling. 53 (1): 230–40. doi:10.1021/ci300510n. PMC 3557555. PMID 23240691.
- ^ Roy A, Skolnick J (February 2015). "LIGSIFT: an open-source tool for ligand structural alignment and virtual screening". Bioinformatics. 31 (4): 539–44. doi:10.1093/bioinformatics/btu692. PMC 4325547. PMID 25336501.
- ^ Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. (January 2012). "ChEMBL: a large-scale bioactivity database for drug discovery". Nucleic Acids Research. 40 (Database issue): D1100–7. doi:10.1093/nar/gkr777. PMC 3245175. PMID 21948594.
- ^ Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, et al. (January 2006). "DrugBank: a comprehensive resource for in silico drug discovery and exploration". Nucleic Acids Research. 34 (Database issue): D668–72. doi:10.1093/nar/gkj067. PMC 1347430. PMID 16381955.
- ^ Réau M, Langenfeld F, Zagury JF, Lagarde N, Montes M (2018). "Decoys Selection in Benchmarking Datasets: Overview and Perspectives". Frontiers in Pharmacology. 9 11. doi:10.3389/fphar.2018.00011. PMC 5787549. PMID 29416509.
- ^ a b Ballester PJ (December 2019). "Selecting machine-learning scoring functions for structure-based virtual screening". Drug Discovery Today: Technologies. 32–33: 81–87. doi:10.1016/j.ddtec.2020.09.001. PMID 33386098. S2CID 224968364.
- ^ Li H, Sze KH, Lu G, Ballester PJ (2021). "Machine-learning scoring functions for structure-based virtual screening". WIREs Computational Molecular Science. 11 (1) e1478. doi:10.1002/wcms.1478. ISSN 1759-0884. S2CID 219089637.
- ^ Wallach I, Heifets A (2018). "Most Ligand-based classification benchmarks reward memorization rather than generalization". Journal of Chemical Information and Modeling. 58 (5): 916–932. arXiv:1706.06619. doi:10.1021/acs.jcim.7b00403. PMID 29698607. S2CID 195345933.
- ^ Irwin JJ (2008). "Community benchmarks for virtual screening". Journal of Computer-Aided Molecular Design. 22 (3–4): 193–9. Bibcode:2008JCAMD..22..193I. doi:10.1007/s10822-008-9189-4. PMID 18273555. S2CID 26260725.
- ^ Good AC, Oprea TI (2008). "Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection?". Journal of Computer-Aided Molecular Design. 22 (3–4): 169–78. Bibcode:2008JCAMD..22..169G. doi:10.1007/s10822-007-9167-2. PMID 18188508. S2CID 7738182.
- ^ Schneider G (April 2010). "Virtual screening: an endless staircase?". Nature Reviews. Drug Discovery. 9 (4): 273–6. doi:10.1038/nrd3139. PMID 20357802. S2CID 205477076.
- ^ Ballester PJ (January 2011). "Ultrafast shape recognition: method and applications". Future Medicinal Chemistry. 3 (1): 65–78. doi:10.4155/fmc.10.280. PMID 21428826.
- ^ Lavecchia A, Di Giovanni C (2013). "Virtual screening strategies in drug discovery: a critical review". Current Medicinal Chemistry. 20 (23): 2839–60. doi:10.2174/09298673113209990001. PMID 23651302.
- ^ Spitzer GM, Heiss M, Mangold M, Markt P, Kirchmair J, Wolber G, et al. (2010). "One concept, three implementations of 3D pharmacophore-based virtual screening: distinct coverage of chemical search space". Journal of Chemical Information and Modeling. 50 (7): 1241–1247. doi:10.1021/ci100136b. PMID 20583761.
- ^ Grant JA, Gallard MA, Pickup BT (1996). "A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape". Journal of Computational Chemistry. 17 (14): 1653–1666. doi:10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K.
- ^ Nicholls A, Grant JA (2005). "Molecular shape and electrostatics in the encoding of relevant chemical information". Journal of Computer-Aided Molecular Design. 19 (9–10): 661–686. Bibcode:2005JCAMD..19..661N. doi:10.1007/s10822-005-9019-x. PMID 16328855.
- ^ Rush TS, Grant JA, Mosyak L, Nicholls A (2005). "A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction". Journal of Medicinal Chemistry. 48 (5): 1489–1495. doi:10.1021/jm040163o. PMID 15743191.
- ^ Ni B, Wang H, Khalaf HK, Blay V, Houston DR (May 2024). "AutoDock-SS: AutoDock for Multiconformational Ligand-Based Virtual Screening". Journal of Chemical Information and Modeling. 64 (9): 3779–3789. doi:10.1021/acs.jcim.4c00136. PMC 11094722. PMID 38624083.
- ^ Cheeseright TJ, Mackey MD, Melville JL, Vinter JG (November 2008). "FieldScreen: virtual screening using molecular fields. Application to the DUD data set". Journal of Chemical Information and Modeling. 48 (11): 2108–2117. doi:10.1021/ci800110p. PMID 18991371.
- ^ Lang S, Slater MJ (May 2024). "Virtual Screening Strategies for Identifying Novel Chemotypes". Journal of Medicinal Chemistry. 67 (9): 6897–6898. doi:10.1021/acs.jmedchem.4c00906. PMID 38654500.
- ^ Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN, Andrade CH (2018-11-13). "QSAR-Based Virtual Screening: Advances and Applications in Drug Discovery". Frontiers in Pharmacology. 9 1275. doi:10.3389/fphar.2018.01275. PMC 6262347. PMID 30524275.
- ^ Alsenan S, Al-Turaiki I, Hafez A (December 2020). "A Recurrent Neural Network model to predict blood-brain barrier permeability". Computational Biology and Chemistry. 89 107377. doi:10.1016/j.compbiolchem.2020.107377. PMID 33010784.
- ^ Dimitri GM, Lió P (June 2017). "DrugClust: A machine learning approach for drugs side effects prediction". Computational Biology and Chemistry. 68: 204–210. doi:10.1016/j.compbiolchem.2017.03.008. PMID 28391063.
- ^ Shoombuatong W, Schaduangrat N, Pratiwi R, Nantasenamat C (June 2019). "THPep: A machine learning-based approach for predicting tumor homing peptides". Computational Biology and Chemistry. 80: 441–451. doi:10.1016/j.compbiolchem.2019.05.008. PMID 31151025.
- ^ Gurujee CS, Deshpande VL (February 1978). "An improved method of substructure analysis". Computers & Structures. 8 (1): 147–152. doi:10.1016/0045-7949(78)90171-2.
- ^ Pradeepkiran JA, Reddy PH (March 2019). "Structure Based Design and Molecular Docking Studies for Phosphorylated Tau Inhibitors in Alzheimer's Disease". Cells. 8 (3): 260. doi:10.3390/cells8030260. PMC 6468864. PMID 30893872.
Further reading
[edit]- Melagraki G, Afantitis A, Sarimveis H, Koutentis PA, Markopoulos J, Igglessi-Markopoulou O (May 2007). "Optimization of biaryl piperidine and 4-amino-2-biarylurea MCH1 receptor antagonists using QSAR modeling, classification techniques and virtual screening". Journal of Computer-Aided Molecular Design. 21 (5): 251–67. Bibcode:2007JCAMD..21..251M. doi:10.1007/s10822-007-9112-4. PMID 17377847. S2CID 19563229.
- Afantitis A, Melagraki G, Sarimveis H, Koutentis PA, Markopoulos J, Igglessi-Markopoulou O (February 2006). "Investigation of substituent effect of 1-(3,3-diphenylpropyl)-piperidinyl phenylacetamides on CCR5 binding affinity using QSAR and virtual screening techniques". Journal of Computer-Aided Molecular Design. 20 (2): 83–95. Bibcode:2006JCAMD..20...83A. CiteSeerX 10.1.1.716.8148. doi:10.1007/s10822-006-9038-2. PMID 16783600. S2CID 21523436.
- Eckert H, Bajorath J (March 2007). "Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches". Drug Discovery Today. 12 (5–6): 225–33. doi:10.1016/j.drudis.2007.01.011. PMID 17331887.
- Willett P (December 2006). "Similarity-based virtual screening using 2D fingerprints" (PDF). Drug Discovery Today (Submitted manuscript). 11 (23–24): 1046–53. doi:10.1016/j.drudis.2006.10.005. PMID 17129822.
- Fara DC, Oprea TI, Prossnitz ER, Bologa CG, Edwards BS, Sklar LA (2006). "Integration of virtual and physical screening". Drug Discovery Today: Technologies. 3 (4): 377–385. doi:10.1016/j.ddtec.2006.11.003. PMC 7105924. PMID 38620118.
- Muegge I, Oloffa S (2006). "Advances in virtual screening". Drug Discovery Today: Technologies. 3 (4): 405–411. doi:10.1016/j.ddtec.2006.12.002. PMC 7105922. PMID 38620182.
- Schneider G (April 2010). "Virtual screening: an endless staircase?". Nature Reviews. Drug Discovery. 9 (4): 273–6. doi:10.1038/nrd3139. PMID 20357802. S2CID 205477076.
External links
[edit]- VLS3D – list of over 2000 databases, online and standalone in silico tools
Virtual screening
View on GrokipediaFundamentals
Definition and Principles
Virtual screening (VS) is an in silico computational technique employed in drug discovery to identify potential bioactive compounds by evaluating large libraries of small molecules, or ligands, against biological targets such as proteins, predicting their ability to form favorable interactions. These libraries can encompass millions to billions of compounds, enabling the rapid assessment of chemical space far beyond what is feasible experimentally.[4][5] The foundational principles of VS revolve around predicting binding affinity, the strength of non-covalent interactions between a ligand and its target, to identify hits—compounds with a high likelihood of binding effectively—and facilitate subsequent lead optimization, where promising hits are refined into more potent drug candidates. Central to this process are molecular interactions such as hydrogen bonding, which involves the sharing of hydrogen atoms between electronegative atoms, and hydrophobic effects, where non-polar regions cluster to minimize exposure to water, stabilizing the ligand-target complex. Unlike high-throughput screening (HTS), which relies on physical assays to test compounds experimentally, VS is purely computational, offering significant reductions in time, cost, and resource demands while prioritizing targets with available structural data or known ligands.[4][5] A typical VS workflow begins with library preparation, where compound databases are curated for drug-likeness and converted into suitable formats for computation. This is followed by screening via predictive models to generate scores reflecting binding potential, ranking the compounds based on these scores to prioritize top candidates, and final hit selection through post-processing to ensure chemical diversity and synthetic feasibility before experimental validation. Analogous to molecular docking, which simulates ligand placement in a target's binding site, these steps provide a high-level framework for hit identification without requiring physical synthesis.[4]Historical Development
The roots of virtual screening trace back to the foundations of computational chemistry in the mid-20th century, with quantitative structure-activity relationship (QSAR) models serving as an early precursor to ligand-based approaches. In 1964, Corwin Hansch and Toshio Fujita introduced the first systematic QSAR framework, correlating chemical structure with biological activity through linear free-energy relationships, which laid the groundwork for predicting ligand potency without direct experimental testing. This methodology evolved through the 1970s and 1980s amid advances in molecular modeling and database management, enabling initial computational searches of small compound libraries for potential drug candidates. By the late 1980s, these efforts had matured into rudimentary ligand-based screening techniques, focusing on similarity searches and basic pharmacophore mapping to identify compounds with desired structural features. The term "virtual screening" emerged in the late 1990s to describe these in silico approaches as analogs to experimental high-throughput screening.[6][7] A pivotal milestone occurred in the 1980s with the advent of structure-based methods, exemplified by the development of the DOCK program in 1982 by Irwin D. Kuntz and colleagues at the University of California, San Francisco. This algorithm pioneered automated docking by geometrically matching ligand atoms to receptor binding sites, allowing the virtual evaluation of thousands of molecules against protein structures derived from X-ray crystallography. The 1990s saw the rise of ligand-based virtual screening, driven by pharmacophore modeling software that identified common spatial arrangements of molecular features essential for activity, such as hydrogen bond donors and hydrophobic regions. Tools like Catalyst (introduced in 1990) facilitated 3D database searches, complementing emerging high-throughput experimental screening and accelerating hit identification in pharmaceutical research.[8] Post-2000, virtual screening became integrated into industrial drug discovery pipelines, bolstered by high-performance computing that enabled screening of millions of compounds in days rather than years. The completion of the Human Genome Project in 2003 dramatically expanded the pool of viable drug targets, from fewer than 500 known proteins to thousands, fueling demand for efficient virtual tools to prioritize candidates.[9] Open-source contributions further democratized access, including AutoDock (first released in 1990 by Arthur Olson's group at Scripps Research Institute), which introduced genetic algorithm-based docking for flexible ligand posing, and RDKit (open-sourced in 2006 after development in the early 2000s), a cheminformatics toolkit supporting fingerprint-based similarity searches and descriptor generation for large-scale ligand-based screening.[10] Around the 2010s, virtual screening underwent a paradigm shift from primarily rule-based and physics-driven methods to data-driven approaches, leveraging machine learning to refine predictions from vast datasets of binding affinities and structural information. This transition enhanced accuracy in handling diverse chemical spaces and reduced false positives, solidifying virtual screening as a standard, cost-effective complement to wet-lab experiments in pharma workflows.[11]Methods
Ligand-Based Methods
Ligand-based methods in virtual screening leverage information from known active compounds to identify potential hits from large chemical databases through assessments of chemical similarity, pharmacophoric features, or predicted physicochemical properties, without necessitating the target's three-dimensional structure. These approaches are particularly valuable when structural data for the biological target is unavailable or unreliable, enabling the prioritization of compounds likely to exhibit similar binding behaviors based on the assumption that structurally or functionally analogous ligands share common interaction profiles. Early implementations focused on simple 2D similarity searching using fingerprints, but evolved to incorporate three-dimensional aspects for more accurate predictions of bioactivity. Pharmacophore models form a cornerstone of ligand-based screening, defined as the three-dimensional arrangement of molecular features—such as hydrogen bond donors and acceptors, hydrophobic centers, aromatic rings, and positively or negatively ionizable groups—that are essential for ligand-target recognition and activity. These models are typically constructed by superimposing a set of known active ligands using techniques like least-squares fitting or clique detection algorithms to identify shared features, followed by validation against inactive compounds to refine specificity. A seminal example is the HipHop algorithm, introduced in the mid-1990s within the Catalyst software suite, which employs a hypothesis-driven approach to generate common-feature pharmacophores from multiple flexible ligand conformations, facilitating database querying for novel scaffolds that match the geometric and chemical constraints. Shape-based virtual screening emphasizes the geometric complementarity of molecular volumes, comparing query and database compounds via overlap metrics that approximate shapes with Gaussian functions or polyhedral representations to account for van der Waals surfaces. This method excels in identifying flexible ligands by generating conformational ensembles and optimizing alignments through combinatorial search algorithms, often outperforming 2D methods in scaffold-hopping scenarios where functional groups vary but overall topology is conserved. The ROCS (Rapid Overlay of Chemical Structures) software exemplifies this paradigm, utilizing Gaussian-based volumetric similarity scoring to rapidly screen millions of compounds, with demonstrated significant enrichment in prospective studies against diverse targets.[12] Field-based virtual screening extends shape considerations by incorporating molecular interaction fields, aligning compounds based on similarities in electrostatic potentials, steric hindrance, and hydrophobic distributions, often represented as graphs or bitstring fingerprints for efficient matching. Field-graph matching techniques discretize these fields into nodes and edges to capture qualitative interaction patterns, enabling the detection of bioisosteric replacements. Similarity between aligned fields is quantified using the Tanimoto coefficient on binary fingerprints, given bywhere and denote the bitsets of query and candidate fields, respectively; values approaching 1 indicate high congruence. Tools like FieldScreen apply this to prioritize diverse chemotypes with analogous field profiles. Quantitative structure-activity relationship (QSAR) models support ligand-based screening by predicting binding affinities or activities from molecular descriptors, serving as filters to rank pharmacophore or shape matches. Two-dimensional QSAR employs topological indices, while three-dimensional variants like Comparative Molecular Field Analysis (CoMFA) probe steric and electrostatic fields at lattice points around aligned ligands, relating them to experimental potencies via partial least squares regression. A prototypical CoMFA equation might take the form
where and are steric and electrostatic descriptors, and , , are fitted coefficients; this approach has been instrumental in optimizing leads for potency, as validated in numerous kinase inhibitor series.
Structure-Based Methods
Structure-based methods in virtual screening leverage the three-dimensional atomic coordinates of the target biomolecule, typically a protein, to predict and evaluate potential ligand binding interactions. These coordinates are obtained from experimental techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, or computational approaches like homology modeling, which construct models based on sequence similarity to known structures. By incorporating the target's geometry and physicochemical properties, these methods enable the simulation of ligand placement within binding pockets, accounting for intermolecular forces like van der Waals, electrostatic, and hydrogen bonding interactions. This contrasts with ligand-based approaches by explicitly modeling target-ligand complementarity rather than relying solely on ligand properties. Protein-ligand docking forms the cornerstone of structure-based virtual screening, involving the prediction of ligand orientations (poses) and binding affinities within the target's active site. In rigid docking, both the protein and ligand are treated as inflexible, which is computationally efficient but less accurate for dynamic systems; flexible docking, however, allows conformational adjustments in the ligand (and sometimes side chains in the protein) to better mimic physiological conditions. Scoring functions assess the quality of docked poses by estimating binding free energy, categorized as force-field-based (physics-derived, e.g., using AMBER or CHARMM parameters), empirical (fitted to experimental data), or knowledge-based (derived from statistical potentials). For instance, AutoDock employs an empirical scoring function that approximates the total binding energy as , where terms represent van der Waals, electrostatic, hydrogen bonding, and desolvation contributions, respectively, enabling rapid evaluation of thousands of compounds. Key algorithms in docking employ stochastic search techniques to explore the vast conformational space efficiently. Genetic algorithms (GAs), inspired by evolutionary processes, iteratively evolve populations of ligand poses through selection, crossover, and mutation to optimize scoring; Monte Carlo simulations, conversely, use random sampling with Metropolis criteria to escape local minima. Prominent software implementations include Glide, which uses a hierarchical filtering approach with an OPLS force field for high-throughput screening, achieving success rates above 70% in pose prediction for diverse targets, and GOLD, which applies GAs with multiple scoring functions like GoldScore (force-field-based) or ChemScore (empirical) to handle ligand flexibility. Binding site identification precedes docking, often via geometric algorithms that detect cavities or pockets using tools like fpocket or CASTp, prioritizing sites with druggability scores based on enclosure and hydrophobicity. Post-docking analysis refines initial results to improve hit identification. Consensus scoring combines ranks or scores from multiple functions (e.g., averaging AutoDock and Glide outputs) to reduce false positives, enhancing enrichment factors by up to 2-5 fold in benchmarks against single scorers. Rescoring with more rigorous methods, such as molecular mechanics Poisson-Boltzmann surface area (MM-PBSA), further evaluates top poses for energetic accuracy. Finally, hits are filtered for absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties using predictive models, ensuring viable leads for experimental validation.Hybrid Methods
Hybrid methods in virtual screening integrate ligand-based and structure-based techniques to leverage their complementary strengths, thereby enhancing prediction robustness and minimizing false positives. A typical workflow begins with ligand-based filtering, such as pharmacophore matching or shape similarity searches, to rapidly triage large compound libraries, followed by structure-based refinement via molecular docking to assess binding poses and affinities more precisely. This sequential synergy allows for efficient enrichment of potential hits while compensating for the limitations of individual paradigms, such as the lack of structural context in ligand-based methods alone.[14] Pharmacophore-constrained docking exemplifies a specific hybrid approach, where pharmacophore models—derived from known ligands or receptor sites—guide pose generation and scoring during docking to enforce critical interactions like hydrogen bonds and hydrophobic contacts. In this method, docking programs generate multiple poses per compound without initial scoring, which are then filtered using receptor-based pharmacophores, achieving up to 95% reduction in decoys while retaining approximately 80% of actives in benchmarks on targets like neuraminidase and CDK2.[14] The PharmDock program implements this by optimizing protein-derived pharmacophores for both sampling and ranking, demonstrating improved bioactive pose identification in virtual screening applications.[15] Similarly, multi-objective scoring functions combine ligand-based metrics, such as shape similarity, with structure-based binding energy estimates to provide a holistic evaluation, as seen in hybrid workflows that yield high enrichment factors on diverse targets.[16] Receptor-based pharmacophore modeling further illustrates hybrid integration by extracting pharmacophore features directly from the protein binding pocket, capturing key interaction sites for subsequent virtual screening. Workflows like Apo2ph4 generate these models from apo or holo protein structures, enabling the rapid identification of pocket-compatible compounds that can then be refined through docking.[17] Ensemble docking hybrids address target flexibility by simulating ligands against multiple protein conformations, often incorporating ligand-based biases; for example, the LigBEnD method uses atomic property fields from known ligands to weight docking scores, achieving over 80% accuracy in pose prediction within 2 Å RMSD for nuclear receptor targets.[18] These hybrid strategies offer enhanced coverage for targets with incomplete ligand or structural data, facilitating more reliable hit identification across challenging systems. In the context of HIV-1 protease inhibitors, a multistage hybrid pipeline combining pharmacophore modeling, shape similarity, and docking screened 260,000 compounds from the NCI database, yielding two novel micromolar inhibitors (IC50 values of 62 μM and 162 μM) with an enrichment factor exceeding 465.[19]Computational Infrastructure
Ligand-Based Approaches
Ligand-based virtual screening relies on computational resources optimized for rapid processing of molecular descriptors and similarity computations, rather than intensive simulations. Hardware requirements emphasize multi-core CPUs for fingerprint generation and similarity searches, with GPUs accelerating matrix operations in large-scale fingerprint comparisons. For instance, tools like PyRMD operate efficiently on modern workstations with at least 4 GB RAM for basic tasks, but screening extensive libraries necessitates higher memory to handle descriptor storage without frequent disk I/O.[20] When processing PubChem-scale databases exceeding 100 million compounds, memory demands typically reach tens of GB of RAM, depending on fingerprint dimensionality and database indexing strategies, to enable in-memory similarity matching and avoid performance bottlenecks.[21] Software infrastructure for ligand-based approaches centers on cheminformatics libraries that facilitate descriptor computation and database querying. Open-source tools such as RDKit provide robust capabilities for generating molecular fingerprints and performing Tanimoto similarity searches, forming the backbone of many screening pipelines.[22] OpenBabel complements these by handling diverse file formats and preprocessing structures for input into similarity algorithms.[23] Commercial platforms, including Schrödinger's Virtual Screening Web Service, offer integrated environments for ligand scouting with advanced pharmacophore and shape-based filtering, enabling seamless workflow automation.[24] Scalability in ligand-based virtual screening is achieved through parallelization techniques tailored to distributed environments. Message Passing Interface (MPI) enables high-level parallelization for similarity matching across clusters, distributing database subsets to multiple nodes for concurrent querying and achieving near-linear speedups on thousands of cores.[25] Cloud computing platforms like AWS support batch processing of millions of compounds, leveraging elastic resources for cost-effective ultra-large library exploration. Optimization strategies focus on reducing computational overhead while preserving chemical information. Extended-connectivity fingerprints (ECFP), such as ECFP4 with 2,048 bits, balance descriptor richness and efficiency by encoding topological features circularly, allowing rapid similarity calculations via bitwise operations.[26] Dimensionality reduction techniques, including feature selection or hashing, further accelerate searches by minimizing vector comparisons, particularly for diverse libraries where subsampling ensures representation of chemical space without exhaustive enumeration.[27]Structure-Based Approaches
Structure-based virtual screening imposes significantly higher computational demands than ligand-based approaches due to its reliance on physics-based simulations, such as molecular docking and dynamics, which require detailed modeling of protein-ligand interactions.[28] High-end graphics processing units (GPUs) are essential for accelerating these calculations, particularly through NVIDIA CUDA-enabled frameworks that parallelize the exhaustive search of conformational spaces during docking.[29] For instance, GPU-optimized docking can reduce computation times for large libraries by up to 10-fold compared to CPU-only systems, enabling the processing of millions of compounds in feasible timeframes. Additionally, substantial storage resources are necessary, often at the terabyte scale for ultra-large libraries, to handle protein structure models, ligand databases, and output trajectories from ensemble-based runs that account for protein flexibility. Key software tools for structure-based virtual screening include docking suites like AutoDock Vina and DOCK, which employ scoring functions to predict binding affinities and poses. AutoDock Vina, for example, leverages multithreading and empirical scoring to achieve up to 60-fold speed improvements over earlier versions, making it suitable for high-throughput applications.[30] DOCK facilitates flexible ligand docking within receptor binding sites, supporting anchor-and-grow strategies for efficient exploration of chemical space. These docking tools are often integrated with molecular dynamics software such as AMBER for post-docking refinement, where simulations stabilize predicted complexes and assess binding stability over time. To achieve scalability, structure-based virtual screening commonly employs grid computing or high-performance computing (HPC) clusters, distributing docking tasks across multiple nodes for parallel execution.[31] For exhaustive searches, such as docking one million compounds against a target, computations may require several days on a cluster of 100 cores, highlighting the need for optimized resource allocation in shared HPC environments.[32] Platforms like EXSCALATE demonstrate extreme-scale capabilities by scaling to full supercomputers, processing billions of compounds through distributed workflows.[33] Optimization strategies mitigate the inherent complexity of these simulations, including incremental docking approaches that build ligand poses stepwise to reduce search space dimensionality.[4] Virtual screening cascades further enhance efficiency by applying sequential filters—such as initial pharmacophore matching followed by refined docking—prioritizing promising candidates and minimizing full computations on low-affinity molecules.[4] These techniques collectively manage the trade-off between accuracy and throughput in resource-intensive structure-based pipelines.[34]Accuracy and Validation
Evaluation Metrics
The performance of virtual screening methods is assessed using quantitative metrics that evaluate their ability to prioritize active compounds over inactives, with a particular emphasis on early recognition given the vast scale of screened libraries.[35] These metrics provide standardized tools for validating computational outputs prior to experimental follow-up, enabling fair comparisons across methods.[36] A primary metric is the enrichment factor (EF), which quantifies the degree to which actives are concentrated in the top-ranked fraction of results compared to random selection. The formula for EF at a given rank fraction (e.g., top 1% or 5%) is where values greater than 1 indicate successful enrichment.[35] Another key measure is the area under the receiver operating characteristic curve (ROC-AUC), which plots the true positive rate against the false positive rate across all thresholds and yields a value between 0 and 1, with 0.5 representing random performance and higher values indicating better overall discrimination.[35] To address limitations in ROC-AUC for prioritizing early hits, the Boltzmann-enhanced discrimination of ROC (BEDROC) applies exponential weighting to emphasize rankings at the list's beginning, producing a score bounded between 0 and 1 that balances statistical rigor with early recognition sensitivity.[35] Additional classification-based measures include sensitivity (the proportion of true actives correctly identified), specificity (the proportion of true inactives correctly excluded), and the Matthews correlation coefficient (MCC), which provides a balanced score from -1 to 1 accounting for true and false positives/negatives, with 0 indicating random classification.[37] Hit rates (fraction of actives recovered) and false positive rates are commonly reported in benchmarks like the Directory of Useful Decoys, Enhanced (DUD-E), where they highlight method efficacy against challenging inactives.[38] Validation protocols rely on decoy sets to simulate real screening scenarios, such as DUD-E's collection of 102 targets with 22,886 actives and over 1.4 million property-matched decoys generated via ZINC to ensure physicochemical similarity but topological dissimilarity (using ECFP4 fingerprints).[38] In ligand-based approaches like quantitative structure-activity relationship (QSAR) modeling, k-fold cross-validation divides data into training and test subsets iteratively to assess generalizability and prevent overfitting.[39] For benchmarking, standardized datasets such as DUD-E and DEKOIS 2.0 enable comparative evaluation of workflows, with DEKOIS 2.0 providing 81 benchmark sets for 80 protein targets, 18,197 actives, and 1,121,074 decoys optimized for docking tests through property matching and diversity filters.[38][40] These resources facilitate the application of metrics like EF and BEDROC to quantify performance across diverse protein families.[40]Challenges and Limitations
Virtual screening (VS) encounters significant technical challenges, particularly in structure-based methods where conformational sampling errors during molecular docking can lead to inaccurate predictions of ligand binding poses. These errors arise from the limited exploration of ligand and protein conformational space, often resulting in suboptimal binding modes that deviate from experimental structures by more than 2 Å RMSD.[41] Target flexibility further complicates docking, as proteins can undergo induced-fit adaptations upon ligand binding, requiring advanced ensemble docking or molecular dynamics simulations to account for multiple receptor states, yet these approaches remain computationally demanding and imperfect.[42] Additionally, the effects of water molecules in the binding site are frequently underrepresented, leading to overestimated binding affinities since explicit solvation models are rarely feasible at scale.[41] In ligand-based methods, descriptor inaccuracies pose a core limitation, as molecular descriptors used in QSAR models often fail to capture subtle electronic or steric features critical for activity prediction, with standard deviations in binding affinity estimates reaching 1-2 kcal/mol.[42] These inaccuracies stem from the empirical nature of many descriptors, which may not generalize across diverse chemical spaces. Data-related issues undermine the reliability of VS models, including biases in training sets where certain chemotypes, such as benzodiazepines or kinase inhibitors, are overrepresented, skewing predictions toward familiar scaffolds and reducing novelty in hit identification.[43] Activity cliffs exacerbate this, occurring when structurally similar compounds exhibit large potency differences (e.g., >100-fold), challenging QSAR models to interpolate accurately and contributing to high prediction errors in cliff-rich regions of chemical space.[44] Practical limitations include the generation of false positives due to approximations in scoring functions, which prioritize speed over precision and often rank non-binders highly, necessitating extensive experimental follow-up that can consume 20-50% of screening budgets.[42] Scalability versus accuracy trade-offs are inherent, as high-throughput docking of million-compound libraries requires simplified models that sacrifice detailed physics-based simulations. Regulatory hurdles in pharmaceutical validation also persist, complicating the acceptance of in silico hits without orthogonal experimental validation. Furthermore, post-2020 developments in covalent inhibitors highlight outdated aspects of traditional VS pipelines, which struggle with reactivity modeling and warhead positioning, as covalent docking tools lag behind the rising prominence of irreversible binders like those targeting SARS-CoV-2 proteases.[45] As of 2025, ongoing advancements include the integration of machine learning for improved validation metrics, such as AI-driven enrichment assessments in ultra-large library screenings, enhancing overall accuracy in diverse targets.[3]Applications
In Drug Discovery
Virtual screening plays a pivotal role in the early stages of drug discovery pipelines by enabling the rapid identification of potential hit compounds from vast chemical libraries, typically comprising millions to billions of molecules. In hit identification, computational methods such as docking or pharmacophore modeling are applied to screen libraries of 10^6 to 10^8 compounds, prioritizing those with favorable binding predictions for subsequent experimental validation, often yielding 50-200 hits for wet-lab testing.[46] This process significantly narrows the search space compared to traditional high-throughput screening, allowing researchers to focus resources on promising candidates. During lead optimization, iterative virtual screening refines these hits by incorporating structure-activity relationship data and molecular dynamics simulations, guiding the design of analogs with improved potency and selectivity.[47] Notable case studies illustrate the practical impact of virtual screening in identifying therapeutic leads. In 2020, structure-based virtual screening targeted the SARS-CoV-2 main protease, screening a library of 235 million compounds to identify three initial inhibitors with micromolar IC₅₀ values, which were further optimized to nanomolar potency and demonstrated broad-spectrum activity against coronaviruses including SARS-CoV-2, SARS-CoV-1, and MERS-CoV.[46] Similarly, a historical ligand-based virtual screening effort in 2010 combined pharmacophore modeling with docking to discover novel glycogen synthase kinase-3β (GSK-3β) inhibitors, such as 2-anilino-5-phenyl-1,3,4-oxadiazole derivatives, exhibiting nanomolar affinity, selectivity over CDK2, and in vivo efficacy in increasing liver glycogen accumulation.[48] The economic advantages of virtual screening stem from its ability to reduce the time and cost of drug discovery by minimizing reliance on resource-intensive wet-lab assays; for instance, it can significantly decrease the number of compounds requiring physical synthesis and testing, accelerating the path from hit to clinical candidate.[49] In drug repurposing, virtual screening has proven invaluable, as seen in the 2021 identification of repurposed inhibitors for SARS-CoV-2's main protease and RNA-dependent RNA polymerase from a library of 6,218 approved drugs, yielding seven cell-active hits including omipalisib, which showed 200-fold greater potency than remdesivir in human lung cells and synergistic effects in combinations.[50] Post-2020 applications have expanded to AI-assisted virtual screening for rare diseases, where machine learning models enhance hit prediction accuracy to 80-90%.[51]In Other Scientific Fields
Virtual screening has been adapted to agrochemical discovery, where it facilitates the identification of novel pesticides and herbicides by targeting specific enzymes in target organisms. For instance, structure-based virtual screening combined with molecular docking has been employed to discover inhibitors of acetolactate synthase (ALS), a key enzyme in branched-chain amino acid biosynthesis in plants, leading to the development of novel non-sulfonylurea herbicides that effectively control weeds while minimizing off-target effects.[52] Similarly, machine learning-enhanced virtual screening platforms have been developed to predict herbicide-likeness and screen large chemical libraries for compounds inhibiting ALS, resulting in candidates with improved potency and reduced environmental persistence compared to traditional methods.[53] These applications demonstrate how virtual screening accelerates the discovery of mode-of-action-specific agrochemicals, addressing challenges like herbicide resistance.[54] In materials science, virtual screening supports the rational design of ligands for catalysts and sensors by evaluating binding affinities and properties across vast chemical spaces. High-throughput computational screening has been used to identify optimal organic linkers for metal-organic frameworks (MOFs), enabling the discovery of structures with enhanced performance for gas storage and separation.[55] For sensors, computational approaches predict interactions between MOF pores and target analytes, facilitating the development of selective gas sensors.[56] Molecular docking simulations further refine these designs by assessing ligand-framework stability, as seen in screenings that prioritize ligands for robust, tunable MOF-based catalysts.[57] Environmental applications leverage virtual screening to identify compounds or enzymes that degrade pollutants, promoting bioremediation strategies. In silico docking and pharmacophore modeling have been applied to screen potential substrates for laccase enzymes, which oxidize phenolic pollutants like dyes and pesticides, predicting degradation pathways and binding energies to guide enzyme engineering for wastewater treatment.[58] Structure-based virtual screening has also identified variants of cytochrome P450 enzymes (e.g., CYP120A1) with enhanced thermostability and activity against sulfonamide antibiotics, enabling more efficient microbial bioremediation of contaminated soils.[59] These approaches reduce experimental trial-and-error, focusing on inhibitors or activators that accelerate pollutant breakdown into non-toxic byproducts.[60] Emerging uses of virtual screening extend to toxicology prediction and food safety. In toxicology, ensemble-based virtual screening models predict compound toxicity by integrating molecular descriptors and machine learning, filtering out hazardous candidates early in chemical design with improved predictive performance.[61] For food safety, computational screening has been applied to identify potential therapeutic peptides.[62]Advances and Future Directions
Machine Learning Integration
Machine learning has been integrated into virtual screening to enhance the prediction of molecular activities by learning complex patterns from chemical datasets, surpassing traditional rule-based methods in handling high-dimensional data. Supervised learning approaches, such as random forests and neural networks applied to molecular graphs, enable accurate classification and regression of binding affinities and bioactivities. For instance, random forests aggregate multiple decision trees to predict compound efficacy, achieving enrichment factors up to 20-fold in hit identification compared to random selection. Unsupervised methods, like clustering on descriptor spaces, aid in exploring chemical space for novel leads.[63] Substructural analysis leverages fragment-based machine learning to pinpoint bioactive motifs within molecules, facilitating the identification of key pharmacophores. Techniques such as support vector machines trained on fragment descriptors have successfully isolated motifs responsible for target inhibition, as demonstrated in inhibitor discovery for calcium and integrin-binding protein 1 (CIB1), where ML-driven fragment screening yielded novel ligands with confirmed binding affinities in the micromolar range. Scaffold hopping, which replaces core structures while preserving activity, is advanced by graph neural networks (GNNs) that encode molecular topologies as graphs, propagating features across atoms to generate analogous scaffolds.[64] Recursive partitioning, a foundational ensemble technique in quantitative structure-activity relationship (QSAR) modeling, builds decision trees on molecular descriptors to classify compounds iteratively. Random forests extend this by averaging predictions from numerous trees, reducing overfitting and enhancing robustness in virtual screening. In 2023 and 2024, ensemble learning approaches, including boosting and stacking, have been applied in virtual screening for drug discovery, combining multiple machine learning models to improve prediction accuracy and outperform single models in identifying potential drug candidates from large chemical libraries, particularly in ligand-based and structure-based virtual screening. Node splitting in these trees minimizes impurity measures, such as the Gini index, defined as where represents the proportion of instances in class among classes; the optimal split selects the descriptor threshold that maximizes the reduction in weighted Gini impurity across child nodes.[63] Deep learning advances have transformed virtual screening through convolutional neural networks (CNNs) that process molecular fields as image-like representations, capturing spatial interactions for scoring functions. Models like Gnina employ CNNs for pose prediction and affinity estimation, outperforming traditional docking in success rates by 10-20% on diverse targets. Transformer-based models, such as ChemBERTa pretrained on over 77 million SMILES strings via self-supervised learning, excel in property prediction tasks relevant to screening, achieving ROC-AUC scores of 0.78-0.84 on MoleculeNet datasets like Tox21 and HIV, with performance scaling logarithmically with pretraining data size. To address imbalanced datasets common in virtual screening—where actives are rare—techniques like oversampling and focal loss have been integrated, boosting precision by up to 30% in hit enrichment.[65][66] Post-2020 developments include generative models for de novo design, which synthesize novel molecules conditioned on desired properties, expanding the screened chemical space beyond existing libraries. Variational autoencoders and generative adversarial networks (GANs) have generated drug-like candidates with optimized pharmacokinetics, as in REINVENT, which produced more synthesizable leads than random enumeration while maintaining target affinity. These models integrate seamlessly into virtual screening pipelines, prioritizing generated compounds for docking and reducing experimental costs. As of 2025, diffusion models have further advanced this area, enabling high-fidelity 3D molecular generation conditioned on protein targets, improving lead optimization efficiency.[67][68]Emerging Technologies and Trends
Quantum computing is emerging as a transformative technology for virtual screening, particularly in enhancing the accuracy of energy calculations during molecular docking. Algorithms such as the variational quantum eigensolver (VQE) enable precise computation of binding free energies by leveraging quantum superposition to model complex molecular interactions that classical computers struggle with due to exponential scaling.[69] This approach promises to revolutionize structure-based virtual screening by providing quantum-accurate simulations of protein-ligand binding, potentially accelerating hit identification in drug discovery pipelines.[70] Early applications have demonstrated VQE's feasibility for small-molecule systems, with ongoing research focusing on scaling to larger biomolecular complexes.[71] Advancements in artificial intelligence are further propelling virtual screening through specialized generative models and privacy-preserving frameworks. Generative adversarial networks (GANs) facilitate de novo library design by generating diverse, drug-like molecules that optimize desired properties, such as binding affinity, while exploring vast chemical spaces more efficiently than traditional enumeration methods.[72] For instance, GAN-based architectures have been optimized to produce chemically valid structures, addressing challenges like mode collapse in training and enabling targeted lead optimization.[73] Complementing this, federated learning allows secure sharing of proprietary datasets across institutions without centralizing sensitive information, fostering collaborative virtual screening for drug discovery while maintaining data privacy through decentralized model updates.[74] Initiatives like the MELLODDY consortium exemplify this, integrating ADME-Tox predictions from multiple pharmaceutical partners to enhance screening accuracy.[75] Key trends in virtual screening include deeper integration with experimental structural biology and efforts toward sustainable computing practices. The 2017 Nobel Prize in Chemistry for cryo-electron microscopy (cryo-EM) has catalyzed its synergy with computational methods, providing high-resolution structures of challenging targets like membrane proteins to inform more reliable docking and screening campaigns.[76] This post-Nobel expansion has improved structure quality for virtual screening, enabling better prediction of ligand poses in dynamic complexes.[77] Blockchain technology supports secure collaborations by enabling tamper-proof sharing of screening results and intellectual property in distributed networks, reducing risks in multi-party drug discovery efforts.[74] Additionally, sustainability initiatives in high-performance computing (HPC) address the environmental footprint of large-scale virtual screening, with green HPC strategies optimizing energy efficiency through workload-aware scheduling and renewable-powered data centers to minimize carbon emissions from intensive simulations.[78] Looking toward the 2030s, virtual screening is poised for real-time applications in personalized medicine, where AI-driven platforms could dynamically tailor compound libraries to individual genomic profiles for rapid hit selection.[79] Post-2023 innovations, such as diffusion models for molecular generation, are bridging this gap by enabling conditional synthesis of 3D drug-like molecules conditioned on target structures, enhancing virtual screening's ability to explore novel chemical spaces with high fidelity.[68] These models, including target-aware variants, have shown promise in generating pharmacophore-aligned ligands, potentially streamlining lead optimization and supporting on-demand screening in clinical settings by the decade's end.[80] Overall, these trajectories emphasize hybrid quantum-AI systems and ethical data practices as cornerstones for scalable, impactful virtual screening.[81]References
- https://doi.org/10.1002/(SICI)1096-987X(19981115)19:14<1639::AID-JCC10>3.0.CO;2-B
