Recent from talks
Nothing was collected or created yet.
Weighted correlation network analysis
View on WikipediaWeighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA), is a widely used data mining method especially for studying biological networks based on pairwise correlations between variables. While it can be applied to most high-dimensional data sets, it has been most widely used in genomic applications. It allows one to define modules (clusters), intramodular hubs, and network nodes with regard to module membership, to study the relationships between co-expression modules, and to compare the network topology of different networks (differential network analysis). WGCNA can be used as a data reduction technique (related to oblique factor analysis), as a clustering method (fuzzy clustering), as a feature selection method (e.g. as gene screening method), as a framework for integrating complementary (genomic) data (based on weighted correlations between quantitative variables), and as a data exploratory technique.[1] Although WGCNA incorporates traditional data exploratory techniques, its intuitive network language and analysis framework transcend any standard analysis technique. Since it uses network methodology and is well suited for integrating complementary genomic data sets, it can be interpreted as systems biologic or systems genetic data analysis method. By selecting intramodular hubs in consensus modules, WGCNA also gives rise to network based meta analysis techniques.[2]
History
[edit]The WGCNA method was developed by Steve Horvath, a professor of human genetics at the David Geffen School of Medicine at UCLA and of biostatistics at the UCLA Fielding School of Public Health and his colleagues at UCLA, and (former) lab members (in particular Peter Langfelder, Bin Zhang, Jun Dong). Much of the work arose from collaborations with applied researchers. In particular, weighted correlation networks were developed in joint discussions with cancer researchers Paul Mischel, Stanley F. Nelson, and neuroscientists Daniel H. Geschwind, Michael C. Oldham, according to the acknowledgement section in.[1]
Comparison between weighted and unweighted correlation networks
[edit]A weighted correlation network can be interpreted as special case of a weighted network, dependency network or correlation network. Weighted correlation network analysis can be attractive for the following reasons:
- The network construction (based on soft thresholding the correlation coefficient) preserves the continuous nature of the underlying correlation information. For example, weighted correlation networks that are constructed on the basis of correlations between numeric variables do not require the choice of a hard threshold. Dichotomizing information and (hard)-thresholding may lead to information loss.[3]
- The network construction gives highly robust results with respect to different choices of the soft threshold.[3] In contrast, results based on unweighted networks, constructed by thresholding a pairwise association measure, often strongly depend on the threshold.
- Weighted correlation networks facilitate a geometric interpretation based on the angular interpretation of the correlation, chapter 6 in.[4]
- Resulting network statistics can be used to enhance standard data-mining methods such as cluster analysis since (dis)-similarity measures can often be transformed into weighted networks;[5] see chapter 6 in.[4]
- WGCNA provides powerful module preservation statistics which can be used to quantify similarity to another condition. Also module preservation statistics allow one to study differences between the modular structure of networks.[6]
- Weighted networks and correlation networks can often be approximated by "factorizable" networks.[4][7] Such approximations are often difficult to achieve for sparse, unweighted networks. Therefore, weighted (correlation) networks allow for a parsimonious parametrization (in terms of modules and module membership) (chapters 2, 6 in [1]) and.[8]
Method
[edit]First, one defines a gene co-expression similarity measure which is used to define the network. We denote the gene co-expression similarity measure of a pair of genes i and j by . Many co-expression studies use the absolute value of the correlation as an unsigned co-expression similarity measure,
where gene expression profiles and consist of the expression of genes i and j across multiple samples. However, using the absolute value of the correlation may obfuscate biologically relevant information, since no distinction is made between gene repression and activation. In contrast, in signed networks the similarity between genes reflects the sign of the correlation of their expression profiles. Varied transformation (or scaling) approaches can be considered if a signed co-expression measure between gene expression profiles and is needed. For example, one can (linearly) scale the correlations to be within the range by performing a simple transformation of the correlations as follows:
As the unsigned measure , the signed similarity takes on a value between 0 and 1. Note that the unsigned similarity between two oppositely expressed genes () equals 1 while it equals 0 for the signed similarity. Similarly, while the unsigned co-expression measure of two genes with zero correlation remains zero, the signed similarity equals 0.5.
Next, an adjacency matrix (network), , is used to quantify how strongly genes are connected to one another. is defined by thresholding the co-expression similarity matrix . 'Hard' thresholding (dichotomizing) the similarity measure results in an unweighted gene co-expression network. Specifically an unweighted network adjacency is defined to be 1 if and 0 otherwise. Because hard thresholding encodes gene connections in a binary fashion, it can be sensitive to the choice of the threshold and result in the loss of co-expression information.[3] The continuous nature of the co-expression information can be preserved by employing soft thresholding, which results in a weighted network. Specifically, WGCNA uses the following power function assess their connection strength:
,
where the power is the soft thresholding parameter. The default values and are used for unsigned and signed networks, respectively. Alternatively, can be chosen using the scale-free topology criterion which amounts to choosing the smallest value of such that approximate scale free topology is reached.[3]
Since , the weighted network adjacency is linearly related to the co-expression similarity on a logarithmic scale. Note that a high power transforms high similarities into high adjacencies, while pushing low similarities towards 0. Since this soft-thresholding procedure applied to a pairwise correlation matrix leads to weighted adjacency matrix, the ensuing analysis is referred to as weighted gene co-expression network analysis.
A major step in the module centric analysis is to cluster genes into network modules using a network proximity measure. Roughly speaking, a pair of genes has a high proximity if it is closely interconnected. By convention, the maximal proximity between two genes is 1 and the minimum proximity is 0. Typically, WGCNA uses the topological overlap measure (TOM) as proximity.[9][10] which can also be defined for weighted networks.[3] The TOM combines the adjacency of two genes and the connection strengths these two genes share with other "third party" genes. The TOM is a highly robust measure of network interconnectedness (proximity). This proximity is used as input of average linkage hierarchical clustering. Modules are defined as branches of the resulting cluster tree using the dynamic branch cutting approach.[11] Next the genes inside a given module are summarized with the module eigengene, which can be considered as the best summary of the standardized module expression data.[4] The module eigengene of a given module is defined as the first principal component of the standardized expression profiles. Eigengenes define robust biomarkers,[12] and can be used as features in complex machine learning models such as Bayesian networks.[13] To find modules that relate to a clinical trait of interest, module eigengenes are correlated with the clinical trait of interest, which gives rise to an eigengene significance measure. Eigengenes can be used as features in more complex predictive models including decision trees and Bayesian networks.[12] One can also construct co-expression networks between module eigengenes (eigengene networks), i.e. networks whose nodes are modules.[14] To identify intramodular hub genes inside a given module, one can use two types of connectivity measures. The first, referred to as , is defined based on correlating each gene with the respective module eigengene. The second, referred to as kIN, is defined as a sum of adjacencies with respect to the module genes. In practice, these two measures are equivalent.[4] To test whether a module is preserved in another data set, one can use various network statistics, e.g. .[6]
Applications
[edit]WGCNA has been widely used for analyzing gene expression data (i.e. transcriptional data), e.g. to find intramodular hub genes.[2][15] Such as, WGCNA study reveals novel transcription factors are associated with Bisphenol A (BPA) dose-response.[16]
It is often used as data reduction step in systems genetic applications where modules are represented by "module eigengenes" e.g.[17][18] Module eigengenes can be used to correlate modules with clinical traits. Eigengene networks are coexpression networks between module eigengenes (i.e. networks whose nodes are modules) . WGCNA is widely used in neuroscientific applications, e.g.[19][20] and for analyzing genomic data including microarray data,[21] single cell RNA-Seq data[22][23] DNA methylation data,[24] miRNA data, peptide counts[25] and microbiota data (16S rRNA gene sequencing).[26] Other applications include brain imaging data, e.g. functional MRI data.[27]
R software package
[edit]The WGCNA R software package[28] provides functions for carrying out all aspects of weighted network analysis (module construction, hub gene selection, module preservation statistics, differential network analysis, network statistics). The WGCNA package is available from the Comprehensive R Archive Network (CRAN), the standard repository for R add-on packages.
References
[edit]- ^ a b c Horvath S (2011). Weighted Network Analysis: Application in Genomics and Systems Biology. New York, NY: Springer. ISBN 978-1-4419-8818-8.
- ^ a b Langfelder P, Mischel PS, Horvath S, Ravasi T (17 April 2013). "When Is Hub Gene Selection Better than Standard Meta-Analysis?". PLOS ONE. 8 (4) e61505. Bibcode:2013PLoSO...861505L. doi:10.1371/journal.pone.0061505. PMC 3629234. PMID 23613865.
- ^ a b c d e Zhang B, Horvath S (2005). "A general framework for weighted gene co-expression network analysis" (PDF). Statistical Applications in Genetics and Molecular Biology. 4: 17. CiteSeerX 10.1.1.471.9599. doi:10.2202/1544-6115.1128. PMID 16646834. S2CID 7756201. Archived from the original (PDF) on 2020-09-28. Retrieved 2013-11-29.
- ^ a b c d e Horvath S, Dong J (2008). "Geometric Interpretation of Gene Coexpression Network Analysis". PLOS Computational Biology. 4 (8) e1000117. Bibcode:2008PLSCB...4E0117H. doi:10.1371/journal.pcbi.1000117. PMC 2446438. PMID 18704157.
- ^ Oldham MC, Langfelder P, Horvath S (12 June 2012). "Network methods for describing sample relationships in genomic datasets: application to Huntington's disease". BMC Systems Biology. 6: 63. doi:10.1186/1752-0509-6-63. PMC 3441531. PMID 22691535.
- ^ a b Langfelder P, Luo R, Oldham MC, Horvath S (20 January 2011). "Is my network module preserved and reproducible?". PLOS Computational Biology. 7 (1) e1001057. Bibcode:2011PLSCB...7E1057L. doi:10.1371/journal.pcbi.1001057. PMC 3024255. PMID 21283776.
- ^ Dong J, Horvath S (4 June 2007). "Understanding network concepts in modules". BMC Systems Biology. 1: 24. doi:10.1186/1752-0509-1-24. PMC 3238286. PMID 17547772.
- ^ Ranola JM, Langfelder P, Lange K, Horvath S (14 March 2013). "Cluster and propensity based approximation of a network". BMC Systems Biology. 7: 21. doi:10.1186/1752-0509-7-21. PMC 3663730. PMID 23497424.
- ^ Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002). "Hierarchical organization of modularity in metabolic networks". Science. 297 (5586): 1551–1555. arXiv:cond-mat/0209244. Bibcode:2002Sci...297.1551R. doi:10.1126/science.1073374. PMID 12202830. S2CID 14452443.
- ^ Yip AM, Horvath S (24 January 2007). "Gene network interconnectedness and the generalized topological overlap measure". BMC Bioinformatics. 8: 22. doi:10.1186/1471-2105-8-22. PMC 1797055. PMID 17250769.
- ^ Langfelder P, Zhang B, Horvath S (2007). "Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R". Bioinformatics. 24 (5): 719–20. doi:10.1093/bioinformatics/btm563. PMID 18024473. S2CID 1095190.
- ^ a b Foroushani A, Agrahari R, Docking R, Chang L, Duns G, Hudoba M, Karsan A, Zare H (16 March 2017). "Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications". BMC Medical Genomics. 10 (1): 16. doi:10.1186/s12920-017-0253-6. PMC 5353782. PMID 28298217.
- ^ Agrahari, Rupesh; Foroushani, Amir; Docking, T. Roderick; Chang, Linda; Duns, Gerben; Hudoba, Monika; Karsan, Aly; Zare, Habil (3 May 2018). "Applications of Bayesian network models in predicting types of hematological malignancies". Scientific Reports. 8 (1): 6951. Bibcode:2018NatSR...8.6951A. doi:10.1038/s41598-018-24758-5. ISSN 2045-2322. PMC 5934387. PMID 29725024.
- ^ Langfelder P, Horvath S (2007). "Eigengene networks for studying the relationships between co-expression modules". BMC Systems Biology. 2007 (1): 54. doi:10.1186/1752-0509-1-54. PMC 2267703. PMID 18031580.
- ^ Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006). "Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target". PNAS. 103 (46): 17402–17407. Bibcode:2006PNAS..10317402H. doi:10.1073/pnas.0608396103. PMC 1635024. PMID 17090670.
- ^ Hartung, Thomas; Kleensang, Andre; Tran, Vy; Maertens, Alexandra (2018). "Weighted Gene Correlation Network Analysis (WGCNA) Reveals Novel Transcription Factors Associated With Bisphenol A Dose-Response". Frontiers in Genetics. 9: 508. doi:10.3389/fgene.2018.00508. ISSN 1664-8021. PMC 6240694. PMID 30483308.
- ^ Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK, Leonardson A, Castellini LW, Wang S, Champy MF, Zhang B, Emilsson V, Doss S, Ghazalpour A, Horvath S, Drake TA, Lusis AJ, Schadt EE (27 March 2008). "Variations in DNA elucidate molecular networks that cause disease". Nature. 452 (7186): 429–35. Bibcode:2008Natur.452..429C. doi:10.1038/nature06757. PMC 2841398. PMID 18344982.
- ^ Plaisier CL, Horvath S, Huertas-Vazquez A, Cruz-Bautista I, Herrera MF, Tusie-Luna T, Aguilar-Salinas C, Pajukanta P, Storey JD (11 September 2009). "A Systems Genetics Approach Implicates USF1, FADS3, and Other Causal Candidate Genes for Familial Combined Hyperlipidemia". PLOS Genetics. 5 (9) e1000642. doi:10.1371/journal.pgen.1000642. PMC 2730565. PMID 19750004.
- ^ Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, Mill J, Cantor RM, Blencowe BJ, Geschwind DH (25 May 2011). "Transcriptomic analysis of autistic brain reveals convergent molecular pathology". Nature. 474 (7351): 380–4. doi:10.1038/nature10110. PMC 3607626. PMID 21614001.
- ^ Hawrylycz MJ, Lein ES, Guillozet-Bongaarts AL, Shen EH, Ng L, Miller JA, van de Lagemaat LN, Smith KA, Ebbert A, Riley ZL, Abajian C, Beckmann CF, Bernard A, Bertagnolli D, Boe AF, Cartagena PM, Chakravarty MM, Chapin M, Chong J, Dalley RA, David Daly B, Dang C, Datta S, Dee N, Dolbeare TA, Faber V, Feng D, Fowler DR, Goldy J, Gregor BW, Haradon Z, Haynor DR, Hohmann JG, Horvath S, Howard RE, Jeromin A, Jochim JM, Kinnunen M, Lau C, Lazarz ET, Lee C, Lemon TA, Li L, Li Y, Morris JA, Overly CC, Parker PD, Parry SE, Reding M, Royall JJ, Schulkin J, Sequeira PA, Slaughterbeck CR, Smith SC, Sodt AJ, Sunkin SM, Swanson BE, Vawter MP, Williams D, Wohnoutka P, Zielke HR, Geschwind DH, Hof PR, Smith SM, Koch C, Grant S, Jones AR (20 September 2012). "An anatomically comprehensive atlas of the adult human brain transcriptome". Nature. 489 (7416): 391–399. Bibcode:2012Natur.489..391H. doi:10.1038/nature11405. PMC 4243026. PMID 22996553.
- ^ Kadarmideen HN, Watson-Haigh NS, Andronicos NM (2011). "Systems biology of ovine intestinal parasite resistance: disease gene modules and biomarkers". Molecular BioSystems. 7 (1): 235–246. doi:10.1039/C0MB00190B. PMID 21072409.
- ^ Kogelman LJ, Cirera S, Zhernakova DV, Fredholm M, Franke L, Kadarmideen HN (30 September 2014). "Identification of co-expression gene networks, regulatory genes and pathways for obesity based on adipose tissue RNA Sequencing in a porcine model". BMC Medical Genomics. 7 (1): 57. doi:10.1186/1755-8794-7-57. PMC 4183073. PMID 25270054.
- ^ Xue Z, Huang K, Cai C, Cai L, Jiang CY, Feng Y, Liu Z, Zeng Q, Cheng L, Sun YE, Liu JY, Horvath S, Fan G (29 August 2013). "Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing". Nature. 500 (7464): 593–7. Bibcode:2013Natur.500..593X. doi:10.1038/nature12364. PMC 4950944. PMID 23892778.
- ^ Horvath S, Zhang Y, Langfelder P, Kahn RS, Boks MP, van Eijk K, van den Berg LH, Ophoff RA (3 October 2012). "Aging effects on DNA methylation modules in human brain and blood tissue". Genome Biology. 13 (10): R97. doi:10.1186/gb-2012-13-10-r97. PMC 4053733. PMID 23034122.
- ^ Shirasaki DI, Greiner ER, Al-Ramahi I, Gray M, Boontheung P, Geschwind DH, Botas J, Coppola G, Horvath S, Loo JA, Yang XW (12 July 2012). "Network organization of the huntingtin proteomic interactome in mammalian brain". Neuron. 75 (1): 41–57. doi:10.1016/j.neuron.2012.05.024. PMC 3432264. PMID 22794259.
- ^ Tong, Maomeng; Li, Xiaoxiao; Wegener Parfrey, Laura; Roth, Bennett; Ippoliti, Andrew; Wei, Bo; Borneman, James; McGovern, Dermot P. B.; Frank, Daniel N.; Li, Ellen; Horvath, Steve; Knight, Rob; Braun, Jonathan (2013). "A Modular Organization of the Human Intestinal Mucosal Microbiota and Its Association with Inflammatory Bowel Disease". PLOS ONE. 8 (11) e80702. doi:10.1371/JOURNAL.PONE.0080702. PMC 3834335. PMID 24260458.
- ^ Mumford JA, Horvath S, Oldham MC, Langfelder P, Geschwind DH, Poldrack RA (1 October 2010). "Detecting network modules in fMRI time series: a weighted network analysis approach". NeuroImage. 52 (4): 1465–76. doi:10.1016/j.neuroimage.2010.05.047. PMC 3632300. PMID 20553896.
- ^ Langfelder P, Horvath S (29 December 2008). "WGCNA: an R package for weighted correlation network analysis". BMC Bioinformatics. 9: 559. doi:10.1186/1471-2105-9-559. PMC 2631488. PMID 19114008.
Weighted correlation network analysis
View on GrokipediaOverview
Definition and Principles
Weighted correlation network analysis (WGCNA) is a systems biology method that constructs weighted networks from high-dimensional data, such as gene expression profiles, by modeling pairwise correlations between variables (e.g., genes) to identify patterns of co-expression and functional modules. Unlike traditional unweighted networks that use hard thresholding to create binary connections, WGCNA employs soft thresholding to assign continuous connection weights ranging from 0 to 1, preserving the full spectrum of correlation strengths and enabling a more nuanced representation of relationships. This approach treats the network as a graph where nodes represent variables and edges represent weighted correlations, facilitating the detection of biologically relevant clusters.[4][1] A core principle of WGCNA is the approximation of scale-free topology in the resulting network, which mimics the structure observed in many biological systems where a small number of highly connected hubs (high-degree nodes) interact with numerous low-degree nodes, promoting robustness and efficient information flow. The scale-free fit is quantified by the coefficient of determination between the observed connectivity distribution and a power-law model, with the soft thresholding parameter selected to achieve (often targeting ) for optimal biological relevance. By emphasizing strong correlations while retaining weaker ones through continuous weighting, WGCNA enhances the robustness of module detection, reducing noise sensitivity and improving the identification of coherent functional groups.[4][1] The basic workflow of WGCNA begins with a data matrix of expression values across samples, followed by computation of pairwise correlations to form a similarity matrix. Correlations are then transformed into an adjacency matrix using soft thresholding, after which modules are detected through hierarchical clustering of the network's topological structure. This process prioritizes scale-free properties to ensure the network captures essential biological organization without overemphasizing outliers.[1] Mathematically, WGCNA relies on the Pearson correlation coefficient between variables and , which measures their co-expression similarity. The adjacency is defined as where is the soft thresholding power that amplifies strong correlations and diminishes weak ones while maintaining continuity; is empirically chosen to fit the scale-free topology criterion. This formulation allows the network to approximate scale-free properties, with higher values yielding sparser, more biologically interpretable connections.[4][1]Key Advantages
One key advantage of weighted correlation network analysis (WGCNA) lies in its robustness to noise inherent in high-dimensional biological data, such as gene expression profiles. By employing soft thresholding through the parameter β, which raises correlation similarities to a power (a_ij = |r_ij|^β where r_ij is the Pearson correlation and β ≥ 1), WGCNA down-weights weak or spurious connections while preserving the continuous nature of co-expression relationships. This approach reduces false positives compared to binary thresholding methods, as it avoids abrupt cutoffs that can amplify noise in datasets with thousands of variables.[1][4] WGCNA also ensures biological realism by enforcing a scale-free topology in the constructed network, mimicking the power-law degree distributions observed in natural systems like protein interaction networks. The soft threshold β is selected such that the network's degree distribution fits a scale-free model, assessed via the linear relationship in a log-log plot of connectivity k versus the probability P(k) (i.e., log(k) vs. log(P(k)) with a high R² value, typically >0.8). This criterion guides parameter choice and enhances the network's stability and interpretability, distinguishing it from arbitrary thresholding in unweighted approaches.[1][4] The module-based framework of WGCNA further streamlines analysis by identifying clusters of co-expressed genes as functional units, often representing pathways or biological processes. These modules are detected using topological overlap measures on the weighted adjacency matrix, allowing dimensionality reduction from thousands of individual genes to a handful of module eigengenes—the first principal components capturing module expression patterns. This summarization facilitates downstream tasks like visualization and hypothesis testing without losing key network structure.[1][4] Integration with external traits represents another strength, enabling the correlation of module eigengenes with phenotypic data, such as disease status or clinical outcomes. This eigengene-trait correlation identifies modules associated with specific biology, prioritizing hubs or entire clusters for further investigation, and supports gene screening for biomarkers.[1] Empirical studies validate these advantages, demonstrating that WGCNA detects more biologically coherent modules than hard-thresholding methods, with improved functional enrichment in gene ontology terms across microarray datasets from cancer and yeast genetics. For instance, weighted networks yield higher module cohesion and better preservation of co-expression signals, leading to enhanced identification of trait-related pathways compared to unweighted alternatives.[4][5]Background
Historical Development
Weighted correlation network analysis (WGCNA) originated in the mid-2000s at the University of California, Los Angeles (UCLA), developed by Steve Horvath, a professor of human genetics and biostatistics, along with colleagues including Bin Zhang and Peter Langfelder. The foundational framework was introduced in 2005 by Zhang and Horvath, who proposed a general method for constructing weighted gene co-expression networks to model complex relationships in high-dimensional biological data, emphasizing scale-free topology criteria to mimic real-world biological networks.[2] This work built on earlier efforts in systems biology to move beyond binary correlations, allowing for continuous connection strengths that better capture subtle co-expression patterns. An early application appeared in 2006, where Oldham et al. applied WGCNA to compare gene co-expression modules across human and chimpanzee brain tissues, demonstrating its utility in evolutionary analyses.[6] By 2007, the method was further refined and applied to quantitative genetics, as in the study by Ghazalpour et al. on mouse weight traits, integrating WGCNA with linkage analysis to identify trait-associated modules.[7] A pivotal milestone came in 2008 with the release of the WGCNA R package by Langfelder and Horvath, published in BMC Bioinformatics, which formalized the approach for gene expression data analysis and incorporated the topological overlap measure to enhance module detection robustness.[1] This package, hosted on Bioconductor, facilitated widespread adoption by providing accessible tools for network construction, module identification, and eigengene analysis, with Horvath's emphasis on scale-free properties ensuring networks reflected biological realism. Community-driven expansions through Bioconductor followed, including refinements to the topological overlap in subsequent updates, such as support for signed networks and intramodular connectivity introduced in the initial package release.[1] Post-2015, WGCNA evolved to support multi-omics integration; for instance, methods like multi-WGCNA in 2021 enabled dimensionality reduction across RNA-seq, proteomics, and metabolomics datasets to uncover shared modules.[8] In the 2020s, adaptations addressed emerging data types, with initial focus on bulk gene expression shifting toward single-cell RNA sequencing (scRNA-seq) and cross-species comparisons. Tools like hdWGCNA, developed and published in 2023, extended WGCNA for high-dimensional single-cell data, identifying cell-type-specific modules in complex tissues such as the brain.[9] Recent advancements include Python implementations to overcome R's scalability limits for large datasets; the pyWGCNA package, released in 2023 and published in Bioinformatics, offers faster computation for RNA-seq module detection using optimized algorithms.[10] In 2024, the CWGCNA R package was introduced to perform causal inference within the WGCNA framework.[11] These developments underscore WGCNA's growth from a gene-centric tool to a versatile framework in systems biology.Comparison to Unweighted Networks
Traditional unweighted correlation networks construct a binary adjacency matrix where the connection strength between genes and is set to 1 if the absolute Pearson correlation coefficient exceeds a predefined threshold , and 0 otherwise.[1] This approach results in discrete, all-or-nothing connections that can produce cliquey structures, where modules appear as tightly knit groups isolated from the rest of the network, particularly when the threshold is high.[4] Additionally, unweighted networks are highly sensitive to the choice of , as varying this parameter drastically alters network topology and connectivity patterns.[1] Key limitations of unweighted networks include the loss of information from weak but consistent correlations, which may represent biologically relevant interactions in noisy genomic data.[4] They often fail to produce scale-free topologies characteristic of real biological networks, instead exhibiting degree distributions with exponential tails rather than power-law decay.[1] In noisy datasets, unweighted methods can overestimate hub gene connectivity by including spurious strong correlations while discarding subtler ones.[4] In contrast, weighted correlation networks address these issues by defining a continuous adjacency (with ), which preserves the hierarchical structure of correlations and incorporates weak connections proportionally to their strength.[1] This soft thresholding enhances module preservation across datasets, as measured by the topological overlap matrix (TOM) dissimilarity, which better captures shared network neighborhoods for clustering.[4] Quantitatively, unweighted networks typically show degree distributions following an exponential form, with fewer hubs and less robustness to perturbations, whereas weighted networks achieve power-law degree distributions with exponents , aligning more closely with scale-free properties observed in biological systems.[1] Empirical studies demonstrate that weighted networks identify more biologically meaningful modules; for instance, in mouse liver gene expression data, weighted approaches detected modules with significantly enriched Gene Ontology (GO) terms, such as glycoprotein biosynthesis (p = 2 × 10^{-24}), outperforming unweighted methods in robustness and functional coherence.[1]Methodology
Adjacency Matrix Construction
The construction of the adjacency matrix represents the foundational step in weighted correlation network analysis (WGCNA), transforming pairwise correlations between network nodes into connection weights that emphasize biologically relevant relationships. The input data typically consist of an expression matrix , where rows correspond to nodes (e.g., genes) and columns to samples (e.g., tissue measurements), with entries representing expression levels. Pairwise correlations are computed between the profiles of nodes and , most commonly using the Pearson correlation coefficient , where is the expression of node in sample , and is its mean across samples.[4][1] For unsigned networks, which treat positive and negative correlations based on their magnitude regardless of sign, the adjacency matrix is defined by the soft-thresholding function , where is a power parameter that amplifies strong correlations while suppressing weak ones, resulting in a continuous weight between 0 and 1. In signed networks, designed to focus on co-activation (positive correlations only) for directed co-expression studies, the adjacency is modified to if , and otherwise; alternatively, a hybrid form can be used to map correlations to [0,1] while preserving sign influence. The choice between unsigned and signed networks depends on the biological question, with signed variants better suited for detecting co-activation modules in processes like gene regulation.[4][1][12] The soft-thresholding parameter is selected to ensure the resulting network approximates a scale-free topology, a hallmark of biological networks where a few nodes (hubs) have many connections and most have few. This is achieved by evaluating the scale-free fit index across a range of values (typically tested from 1 to 20), plotting the log-log slope of node connectivity (row sums of ) versus its frequency , and choosing that maximizes the coefficient of determination to 0.9. The metric is computed as , where is the residual sum of squares from linear regression on the log-log plot, and is the total sum of squares. For gene expression data, often falls in the range 6 to 12, balancing network interconnectedness and biological realism.[4][1][12] Missing data in the expression matrix can bias correlations and must be addressed prior to adjacency construction, typically through imputation methods such as k-nearest neighbors (e.g., via theimpute R package) to estimate absent values based on similar samples or genes. This preprocessing ensures robust pairwise complete observations during correlation computation, preventing artificial disconnection in the network.[1][12]
Topological Overlap and Module Detection
In weighted correlation network analysis (WGCNA), the topological overlap measure (TOM) provides a robust quantification of the interconnectivity between pairs of nodes, such as genes, by assessing the extent to which they share connections in the network. Unlike simple adjacency, TOM captures higher-order similarities, making it particularly suitable for weighted networks where edge strengths vary continuously. The measure for nodes and is defined as where is the adjacency between and , the sum is over all nodes , and is the weighted degree (connectivity) of node . This formulation generalizes the unweighted topological overlap to accommodate soft-thresholded correlations, enhancing sensitivity to indirect connections while reducing noise from spurious links.[13] To identify modules—clusters of highly interconnected nodes—the dissimilarity matrix is computed as , serving as a distance metric for hierarchical clustering. Average linkage clustering is typically applied to this dissimilarity, producing a dendrogram that visualizes the hierarchical structure of node similarities. This approach leverages the scale-free topology inherent in many biological networks, allowing for the detection of cohesive groups without assuming binary connections.[1] Module boundaries are determined using dynamic tree-cutting algorithms, such as thecutreeDynamic function, which partitions the dendrogram based on branch shape and height to automatically identify clusters of varying sizes and densities. This method outperforms static cuts by adapting to the dendrogram's topology, capturing both tight and loose modules while minimizing over- or under-clustering. Subsequently, similar modules are merged if their eigengenes (module summaries) exhibit high correlation, typically above 0.75, to refine the partition and reduce redundancy.[1]
The module eigengene (ME) represents the primary expression pattern within a module and is calculated as the first principal component of the expression profiles of its constituent nodes, effectively summarizing the module's collective behavior. This low-dimensional summary facilitates downstream analyses by condensing high-dimensional data into interpretable profiles.[1]
Module quality is assessed through intramodular connectivity, which measures how strongly individual nodes correlate with their module eigengene (e.g., via Pearson correlation or weighted variants), with higher average connectivity indicating tighter cohesion. Robustness is further evaluated by varying the soft-thresholding parameter (used in adjacency construction) and re-running module detection; consistent module assignments across values confirm stability against parameter sensitivity.[1]
