Hubbry Logo
search
logo

Commonsense knowledge (artificial intelligence)

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

In artificial intelligence research, commonsense knowledge consists of facts about the everyday world, such as "Lemons are sour" or "Cows say moo", that all humans are expected to know. It is currently an unsolved problem in artificial general intelligence. The first AI program to address common sense knowledge was Advice Taker in 1959 by John McCarthy.[1]

Commonsense knowledge can underpin a commonsense reasoning process, to attempt inferences such as "You might bake a cake because you want people to eat the cake." A natural language processing process can be attached to the commonsense knowledge base to allow the knowledge base to attempt to answer questions about the world.[2] Common sense knowledge also helps to solve problems in the face of incomplete information. Using widely held beliefs about everyday objects, or common sense knowledge, AI systems make common sense assumptions or default assumptions about the unknown similar to the way people do. In an AI system or in English, this is expressed as "Normally P holds", "Usually P" or "Typically P so Assume P". For example, if we know the fact "Tweety is a bird", because we know the commonly held belief about birds, "typically birds fly," without knowing anything else about Tweety, we may reasonably assume the fact that "Tweety can fly." As more knowledge of the world is discovered or learned over time, the AI system can revise its assumptions about Tweety using a truth maintenance process. If we later learn that "Tweety is a penguin" then truth maintenance revises this assumption because we also know "penguins do not fly".

Commonsense reasoning

[edit]

Commonsense reasoning simulates the human ability to use commonsense knowledge to make presumptions about the type and essence of ordinary situations they encounter every day, and to change their "minds" should new information come to light. This includes time, missing or incomplete information and cause and effect. The ability to explain cause and effect is an important aspect of explainable AI. Truth maintenance algorithms automatically provide an explanation facility because they create elaborate records of presumptions. Compared with humans, all existing computer programs that attempt human-level AI perform extremely poorly on modern "commonsense reasoning" benchmark tests such as the Winograd Schema Challenge.[3] The problem of attaining human-level competency at "commonsense knowledge" tasks is considered to probably be "AI complete" (that is, solving it would require the ability to synthesize a fully human-level intelligence),[4][5] although some oppose this notion and believe compassionate intelligence is also required for human-level AI.[6] Common sense reasoning has been applied successfully in more limited domains such as natural language processing[7][8] and automated diagnosis[9] or analysis.[10]

Commonsense knowledge base construction

[edit]

Compiling comprehensive knowledge bases of commonsense assertions (CSKBs) is a long-standing challenge in AI research. From early expert-driven efforts like CYC and WordNet, significant advances were achieved via the crowdsourced OpenMind Commonsense project, which led to the crowdsourced ConceptNet KB. Several approaches have attempted to automate CSKB construction, most notably, via text mining (WebChild, Quasimodo, TransOMCS, Ascent), as well as harvesting these directly from pre-trained language models (AutoTOMIC). These resources are significantly larger than ConceptNet, though the automated construction mostly makes them of moderately lower quality. Challenges also remain on the representation of commonsense knowledge: Most CSKB projects follow a triple data model, which is not necessarily best suited for breaking more complex natural language assertions. A notable exception here is GenericsKB, which applies no further normalization to sentences, but retains them in full.

Applications

[edit]

Around 2013, MIT researchers developed BullySpace, an extension of the commonsense knowledgebase ConceptNet, to catch taunting social media comments. BullySpace included over 200 semantic assertions based around stereotypes, to help the system infer that comments like "Put on a wig and lipstick and be who you really are" are more likely to be an insult if directed at a boy than a girl.[11][12][13]

ConceptNet has also been used by chatbots[14] and by computers that compose original fiction.[15] At Lawrence Livermore National Laboratory, common sense knowledge was used in an intelligent software agent to detect violations of a comprehensive nuclear test ban treaty.[16]

Data

[edit]

As an example, as of 2012 ConceptNet includes these 21 language-independent relations:[17]

  • IsA (An "RV" is a "vehicle" | X is an instance of a Y)
  • UsedFor (a "cake tin" is used for "making cakes" | X is used for the purpose Y)
  • HasA (A "rabbit" has a "tail" | X possesses Y element or feature)
  • CapableOf (a "cook" is capable of "making baked goods" | X is capable of doing Y)
  • Desires (a "child" desires "the aroma of baking" | X has a desire for Y)
  • CreatedBy ("cake" is created by a "baker" | X is created by Y)
  • PartOf (a "knife" is be part of a "knife set" | X is a part of Y)
  • Causes ("Heat" causes "cooking"| X is what causes Y)
  • LocatedNear (the "oven" is located near the "refrigerator" | X is located near Y)
  • AtLocation (Somewhere a "Cook" can be at a "restaurant" | X is at the location of Y)
  • DefinedAs (a "Cupcake" is defined as a "cake" that also has the qualities of being "small", "baked within a wrapper", and "containing only one area of frosting or icing" | X is defined as Y that also has the properties A, B & C)
  • SymbolOf (a "heart" is a symbol of "affection" | X is a symbolic representation of Y)
  • ReceivesAction ("cake" can receive the action of being "eaten" | X is capable of receiving action Y)
  • HasPrerequisite ("baking" has the prerequisite of obtaining the "ingredients" | X cannot do Y unless A does B)
  • MotivatedByGoal ("baking" is motivated by the goal of "consumption"/"eating" | X has the motivation of Y goal)
  • CausesDesire ("baking" makesYou want to "follow recipe" | X causes the desire to do Y)
  • MadeOf ("Cake" is made of "flour"/"eggs"/"sugar"/"oil"/etc | X is made of Y)
  • HasFirstSubevent ("baking" has first subevent "make batter" | To do X the first thing that needs to be done is Y)
  • HasSubevent ("eat" has subevent "swallow" | Doing X will lead to Y event following)
  • HasLastSubevent ("sleeping" has last subevent of "waking" | Doing X ends with the event Y)

Commonsense knowledge bases

[edit]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Commonsense knowledge in artificial intelligence refers to the extensive, largely tacit repository of facts, causal relations, physical intuitions, social conventions, and probabilistic expectations about everyday phenomena that humans draw upon intuitively for reasoning, decision-making, and interpretation of events, such as recognizing that unsupported objects fall due to gravity or that breaking a promise typically erodes trust.[1] This knowledge, often acquired incrementally through lived experience rather than explicit instruction, underpins human-level intelligence but eludes comprehensive capture in AI systems, which rely predominantly on statistical patterns from training data rather than grounded understanding.[2] Despite advances in scale, such as large language models exhibiting superficial fluency in commonsense tasks, persistent failures on targeted benchmarks reveal fundamental gaps in robust, adaptable reasoning, including handling novel scenarios or counterfactuals without rote memorization.[3][4] The pursuit of commonsense in AI dates to the field's origins, where it was identified as a core bottleneck for achieving general intelligence, contrasting with narrow successes in domains like chess or image classification that bypass everyday world modeling.[1] Key challenges include the sheer volume and vagueness of required knowledge—estimated in billions of interlinked facts—the absence of clear formal ontologies for non-monotonic and defeasible reasoning, and the causal disconnect in data-driven methods that prioritize correlation over mechanistic insight.[5] Empirical evaluations via benchmarks like CommonsenseQA, which probe multiple-choice questions demanding implicit world knowledge, underscore these limits: top models score around 60-70% accuracy, far below human baselines exceeding 90%, highlighting brittleness in edge cases involving temporal dynamics or physical plausibility.[3][4] Notable initiatives span symbolic and neural paradigms, with Cyc representing a decades-long effort to manually curate over a million axioms into a vast knowledge base for inference, enabling applications in planning and verification but criticized for scalability constraints.[6] More recent hybrid approaches, such as COMET, leverage transformer models pretrained on text corpora to infer relational triples (e.g., "person eats food" implies "person feels full"), generating millions of commonsense edges for downstream tasks like question answering, though reliant on the quality of underlying linguistic data.[7] These developments have spurred progress in specific subdomains, like event causality or social inference, yet controversies persist over whether emergent behaviors in massive models truly encode causal realism or merely mimic patterns, as failures in controlled physical simulations reveal underlying simulation gaps rather than genuine comprehension.[4] Ongoing research emphasizes integrating structured knowledge graphs with probabilistic reasoning to bridge these divides, prioritizing empirical validation over unsubstantiated claims of "human-like" capabilities.[5]

Definition and Scope

Core Concepts and Distinctions

Commonsense knowledge in artificial intelligence denotes the vast array of implicit, experiential understandings that enable humans to interpret and interact with the world intuitively, encompassing intuitive grasp of physical dynamics, social expectations, and causal sequences without reliance on explicit formalization.[8][1] This knowledge is probabilistic and contextually adaptive, grounded in causal mechanisms observable in everyday phenomena; for example, humans presume that an unsupported object will descend due to gravitational pull, or that knocking precedes entering a closed room to honor interpersonal boundaries, inferences drawn from repeated environmental interactions rather than encoded axioms.[9][2] In distinction from domain-specific knowledge, which comprises narrow, explicitly codified facts and procedures tailored to constrained arenas—such as the algebraic notations and legal maneuvers governing chess—commonsense integrates overarching causal principles that preclude implausible outcomes, like the physical impossibility of a pawn achieving flight absent propulsion or lift.[10] Domain expertise permits mastery within bounded rules but falters in cross-context application, whereas commonsense affords seamless extrapolation, revealing AI's limitations when systems excel at pattern-matching trivia yet stumble on holistic causal denial, as in failing to reject aerial pawn motion purely from rule adherence.[1] Central to commonsense is its tacit dimension, whereby much of this capability resists full articulation or systematic enumeration, residing in subsymbolic heuristics shaped by embodiment and trial. Michael Polanyi formalized this in 1966, positing that human knowing inherently involves unverbalizable commitments and integrations, where "we know more than we can tell," complicating efforts to distill it into exhaustive symbolic representations.[11] This foundational knowledge underpins human cognitive resilience, permitting inference amid ambiguity and novelty, in contrast to AI's proneness to failure in unscripted variances. Winograd schemas exemplify the requisite depth, deploying minimal textual shifts to hinge pronoun disambiguation on latent world models; consider "The city councilmen refused the demonstrators a permit because they feared violence," where "they" denotes the councilmen, hinging on the causal implausibility of officials endorsing disruption, a resolution opaque to syntax alone but evident through commonsense priors on institutional conduct.[12][13]

Importance for General Intelligence

Commonsense knowledge constitutes a critical barrier to artificial general intelligence (AGI), enabling systems to extrapolate from sparse data to novel contexts through intuitive causal understanding rather than rote statistical associations, thereby approximating the versatile human-like adaptability Alan Turing outlined in his 1950 imitation game framework for assessing machine intelligence.[14] Absent this capability, AI remains narrow, excelling in data-rich domains but faltering in open-ended reasoning that demands implicit world models, such as predicting object interactions or social contingencies without explicit training examples.[15] Empirical assessments underscore this limitation: while GPT-4, released in March 2023, achieved leading scores on knowledge-intensive benchmarks like MMLU (around 86% accuracy), it demonstrated pronounced weaknesses in physical reasoning tasks, including multimodal scenarios involving dynamics like collisions or stability, where zero-shot performance revealed exploitable failure modes in novel configurations.[16] Similarly, on abstraction-heavy tests such as the ARC benchmark, GPT-4's scores hovered below 50%, far short of human baselines exceeding 80%, highlighting brittleness against adversarial perturbations that commonsense would intuitively mitigate.[17] These gaps manifest in phenomena like hallucinations, where large language models generate plausible but causally incoherent outputs—such as fabricating event sequences violating physical laws—due to overreliance on surface-level correlations absent underlying causal mechanisms, with evidence from targeted interventions showing that explicit causal modeling reduces such errors by enforcing logical consistency.[18] Cognitive scientist Gary Marcus has critiqued this persistence, noting in early 2025 analyses that despite scaling, contemporary systems remain deficient in core commonsense elements like compositionality and factuality, perpetuating vulnerabilities in reasoning over untrained scenarios.[19] For AGI-aligned safety, commonsense is indispensable in unstructured environments, where AI must navigate unscripted physical or social interactions; without it, systems risk hazardous misjudgments, such as disregarding intuitive priors like object permanence or gravitational effects, as observed in robotic deployments lacking integrated world knowledge, thereby compromising reliability in real-world autonomy.[20] This causal grounding not only curbs erroneous actions but also fosters robust generalization, distinguishing superficial benchmark proficiency from the resilient intelligence requisite for AGI.

Historical Development

Early Foundations (1950s-1980s)

In 1959, John McCarthy introduced the concept of the "Advice Taker," a proposed program intended to enable machines to solve problems through deductive reasoning from advice expressed in formal logical sentences, thereby incorporating elements of commonsense knowledge to handle goals like navigation or object manipulation.[21] This framework highlighted the necessity for AI systems to possess not just algorithmic computation but an ability to internalize and apply human-like intuitive rules, distinguishing it from theorem-proving approaches by allowing dynamic advice integration without exhaustive pre-programming.[21] McCarthy and Patrick Hayes, in their 1969 analysis, critiqued prevailing logic-based AI methodologies for overlooking "naive physics"—the everyday, non-formalized understanding of physical interactions that humans rely on implicitly—and formally posed the frame problem, which involves specifying, in a computationally tractable way, the aspects of a situation that remain unaffected by an action amid vast possible inferences.[22] Their argument underscored that pure logical deduction, without mechanisms to prune irrelevant considerations, leads to intractable reasoning explosions, as systems must otherwise evaluate endless frames of unchanged conditions.[22] Terry Winograd's SHRDLU system, published in 1972, represented an early empirical advance by enabling natural language processing and action planning within a constrained "blocks world" micro-domain, where it could interpret commands like "pick up a big red block" and execute them via a virtual robot arm, achieving coherent interactions through procedural and semantic representations.[23] However, SHRDLU's confinement to this toy environment exposed fundamental scalability issues, as extending its hand-crafted knowledge to open-ended real-world tasks demanded encoding prohibitive volumes of commonsense priors about gravity, support relations, and unintended consequences.[23] McCarthy's 1977 elaboration on these epistemological hurdles emphasized that AI reasoning requires modal operators and situational primitives to manage non-monotonic updates—where beliefs change with new evidence—preventing the frame problem from overwhelming systems with combinatorial irrelevancies absent human-like filtering.[24] Concurrently, rule-based expert systems such as MYCIN, operationalized by 1976 for bacterial infection diagnosis, demonstrated domain-specific efficacy with approximately 65% accuracy in therapy recommendations through 450+ heuristic rules derived from medical experts.[25] Yet, these systems exhibited brittleness beyond their narrow scopes, frequently erring on anomalous cases or requiring manual overrides for unarticulated assumptions about patient context and physical causality, illustrating the inadequacy of explicit rule sets without embedded commonsense to handle edge conditions or knowledge gaps.[25][26]

Symbolic Era Initiatives (1990s-2000s)

The Cyc project, directed by Douglas Lenat and transitioning to Cycorp in 1994, exemplified symbolic efforts to hand-code commonsense knowledge during the 1990s, compiling a formal ontology of approximately 100,000 general concepts augmented by over one million logical axioms by the mid-decade.[27] These axioms encoded implicit rules for inference, such as temporal relations and basic causality (e.g., "if an event causes a change, it precedes the change"), facilitating limited automated reasoning in controlled scenarios like taxonomic classification and simple planning tasks.[27] By the early 2000s, the base had expanded to several million assertions, yet required ongoing expert input at a rate of dozens per day per ontologist, underscoring the labor-intensive nature of achieving even modest coverage.[28] Complementing Cyc's top-down approach, the Open Mind Common Sense (OMCS) initiative, started in 1999 at MIT's Media Lab, crowdsourced free-form statements of everyday knowledge from volunteers online, collecting over 700,000 sentences by 2002 that captured intuitive relations like object affordances and social norms.[29] This bottom-up method yielded diverse, real-world insights unattainable through isolated expert encoding, serving as a foundation for later structured graphs like ConceptNet, but demanded substantial post-processing to extract usable triples amid redundancy and vagueness (e.g., filtering subjective assertions via redundancy heuristics).[29] Symbolic initiatives achieved formal rigor in domains such as ontological hierarchies, where Cyc's inference engine resolved ambiguities in semantic networks, but exhibited brittleness in handling exceptions, context shifts, or novel causal chains—issues persisting despite millions of rules, as systems faltered on everyday scenarios requiring intuitive physical or intentional modeling.[1] Critics, including analyses of Cyc's performance, highlighted that hand-crafted rules failed to scale combinatorially, covering only fragments of the vast, tacit knowledge humans deploy effortlessly, with inference often grinding to halt on under-specified inputs.[1] These gaps in causal realism and generalization prompted recognition that pure symbolic encoding, while enabling prototype applications in expert systems, could not autonomously approximate human commonsense breadth without infeasible expansion.[1]

Data-Driven Transition (2010s-2020s)

The 2010s marked a pivotal shift in commonsense knowledge research toward data-driven paradigms, propelled by advances in machine learning and the scalability of crowdsourced datasets. Traditional symbolic methods gave way to statistical approaches that prioritized learning patterns from vast corpora, enabling systems to approximate commonsense through probabilistic inference rather than hand-engineered rules. This transition was facilitated by the explosion of web-scale data and distributed annotation platforms like Amazon Mechanical Turk, which allowed for the rapid assembly of knowledge bases capturing everyday associations and inferences.[30] Crowdsourced knowledge bases exemplified this era's emphasis on scale. ConceptNet, building on its foundational structure, underwent major expansions through versions like 5.5 (circa 2013) and 5.7 (released April 2019), incorporating crowdsourced contributions across 1,300 languages and growing to encompass over 36 million assertions via relational edges derived from linguistic patterns and open collaborations.[31] Similarly, the ATOMIC dataset, introduced in 2019, compiled 877,000 textual if-then relations for event-centric commonsense, focusing on social and physical inferences such as causes, effects, and prerequisites, crowdsourced from annotators to cover everyday scenarios beyond taxonomic knowledge.[32] These resources shifted focus from expert-curated ontologies to empirically derived graphs, enabling machine learning models to extract relational priors at unprecedented volumes. Benchmarks further catalyzed progress by quantifying gaps in commonsense capabilities. The GLUE benchmark, launched in 2018, aggregated tasks involving inference and entailment that implicitly tested commonsense elements, such as recognizing textual implications requiring world knowledge.[33] SuperGLUE, introduced in 2019, intensified this with harder tasks like COPA (choice of plausible alternatives) and the Winograd Schema Challenge, which demand causal and contextual reasoning, exposing how models faltered on adversarial examples despite training on massive data.[34] HellaSwag, also from 2019, targeted next-sentence prediction with 70,000 adversarially generated examples to minimize dataset biases, achieving 95% human accuracy while early models scored under 50%, highlighting reliance on superficial statistical cues over true predictive understanding.[35] However, this correlational focus invited critiques regarding causal deficits and brittleness. Data-driven systems often memorized dataset artifacts, such as positional heuristics or co-occurrence biases, rather than internalizing causal mechanisms, leading to poor generalization across domains—as evidenced by fine-tuned models failing to transfer commonsense across benchmarks in 2020s analyses.[36] Empirical studies underscored how spurious correlations in training data propagated errors, with models exploiting non-causal shortcuts in tasks like those in SuperGLUE, prioritizing pattern matching over the underlying physical or social realities essential for robust inference.[37] This paradigm's successes in narrow metrics thus revealed foundational limits, prompting calls for integrating causal structures to transcend mere statistical approximation.

Core Challenges

Nature and Acquisition of Commonsense

Commonsense knowledge consists of a broad, largely tacit body of implicit facts, assumptions, and causal expectations about the physical, social, and psychological world that humans intuitively apply without explicit instruction. This knowledge is fuzzy and open-ended, involving probabilistic defaults—such as the typical fragility of glass under impact or the social impropriety of interrupting a conversation—rather than rigid axioms, making it resistant to complete formalization.[38] Unlike explicit rules in narrow domains, much of commonsense remains unarticulated, relying on innate priors and experiential accumulation that enable efficient inference in novel situations. Humans acquire commonsense through embodied interaction with their environment, beginning in infancy via sensorimotor exploration that grounds abstract concepts in physical reality.[39] This process leverages few-shot learning, where children infer causal structures—like object permanence or basic intentionality—from sparse observations, supported by core cognitive biases toward intuitive physics and psychology. AI systems, deprived of such embodiment and reliant on textual data, cannot replicate this grounding, resulting in disconnected representations that falter on tasks requiring physical or causal intuition, such as predicting outcomes of unseen mechanical interactions. Acquisition faces inherent hurdles from context-dependence, where interpretations shift dynamically—e.g., "green" denoting color in visual descriptions but metaphorical envy in emotional contexts—demanding situational disambiguation beyond lexical ambiguity resolution.[40] Commonsense's probabilistic essence further complicates matters, as it accommodates exceptions and uncertainties (e.g., birds typically fly, but penguins do not), contrasting with deterministic logic's all-or-nothing inferences.[38] Data-driven AI methods expose voracious requirements: models need thousands to millions of examples for perceptual categories humans master in 1–4 trials, scaling to billions of tokens for rudimentary causal or social intuitions. Large language models' purported emergent commonsense via scaling parameters and data has proven illusory, with performance collapsing in out-of-distribution settings that probe genuine causal grasp rather than pattern matching. These systems succeed on benchmark phrases mimicking training distributions but systematically err on relational logic, physical counterfactuals, or novel phrasings, indicating memorized heuristics over robust acquisition.[41] Such failures persist despite trillions of training tokens, highlighting scale's inadequacy without structured priors or grounding.

Representation and Encoding Difficulties

Common approaches to encoding commonsense knowledge in AI systems include knowledge graphs, where entities are represented as nodes connected by edges denoting relations such as "is-a" or "used-for"; frames, introduced by Marvin Minsky in 1974 as data structures capturing stereotyped situations with slots for expected attributes and defaults; and ontologies, which provide formal hierarchies of concepts and axioms to define semantic relationships.[42][43] These structures aim to formalize the implicit intuitions humans rely on for everyday reasoning, but they impose rigid schemas that contrast with the fluid, context-dependent nature of human cognition.[44] A primary difficulty arises from linguistic and conceptual ambiguity, such as polysemy, where terms like "bank" (river edge or financial institution) require disambiguation through contextual cues that static encodings struggle to capture without exhaustive, branching representations. This leads to brittle mappings, as single-node or edge assignments fail to encode multiple valid interpretations inherent in natural language. Ontologies exacerbate this by enforcing predefined classes, limiting expressivity for nuanced, overlapping concepts prevalent in commonsense domains. Incompleteness compounds these issues, as no finite graph or frame system can exhaustively cover the vast, open-ended scope of commonsense relations; for instance, real-world dynamics demand continual updates for evolving contexts, yet most encodings remain static and prone to gaps in relational coverage. Knowledge graphs like ConceptNet, with over 36 million multilingual edges linking concepts via assertions such as "CapableOf" or "HasSubevent," illustrate shallow encodings that prioritize breadth over depth, often omitting causal hierarchies or probabilistic qualifiers essential for realistic scenarios.[45][46] Critiques highlight how such simplifications ignore the need for hierarchical structures integrating causal dependencies, rendering representations vulnerable to real-world flux where human intuition implicitly models probabilistic and temporal variations. Surveys from 2023 underscore persistent incompleteness in commonsense knowledge graphs, with completion techniques revealing up to 90% missing links in benchmark subsets, underscoring the brittleness of non-dynamic formats.[47][48]

Reasoning and Causal Inference Gaps

Artificial intelligence systems excel at pattern recognition from vast datasets but falter in causal inference, mistaking correlations for underlying mechanisms and thus struggling with counterfactual scenarios that require intervening on causes. For example, large language models (LLMs) often fail to reason that objects like apples would remain aloft in the absence of gravity, as their predictions derive from statistical associations in training data rather than comprehension of physical causation. This limitation manifests in an inability to simulate "what-if" interventions, where altering a causal variable should propagate effects mechanistically, yet AI outputs revert to memorized probabilistic outcomes.[49] The Winograd Schema Challenge, proposed by Terry Winograd in 1972, tests such gaps through pronoun resolution tasks demanding causal commonsense, such as identifying whether a surface caused a cylinder to roll based on contextual mechanics versus mere co-occurrence.[12] Early systems and modern LLMs alike exhibit persistent failures here, with performance hovering below human levels even in the 2020s, as pronoun ambiguity resolution collapses without implicit causal modeling of events like support or motion. These schemas underscore how AI lacks the intuitive physics priors humans acquire developmentally, leading to overgeneralization from superficial linguistic cues.[2] Benchmark evaluations amplify these inference shortcomings; BIG-bench's causal judgment tasks, introduced in 2022, assess LLMs' capacity to discern causation from correlation in narrative scenarios, yielding accuracies often below 60% for frontier models like PaLM, far short of robust mechanistic understanding. Subsets probing counterfactual validity reveal similar brittleness, where models confuse temporal precedence with causality or fail to negate effects under hypothetical interventions. Gary Marcus has critiqued this as stemming from architecture-level deficits, arguing that without engineered priors for core knowledge—such as object permanence or intuitive causality—AI cannot achieve reliable inference beyond data echoes, a view reinforced by empirical regressions in novel perturbations.[50] Post-2023 analyses of abstract reasoning further expose scalability barriers in causal tasks; evaluations adapting ConceptNet for relational inference show LLMs attaining superficial accuracies through memorization but collapsing on unseen causal chains, with qualitative probes indicating no emergent grasp of mechanisms like transitivity in physical interactions.[51] Studies on narrative causal reasoning confirm long-sequence failures, where accumulating events overwhelm correlative heuristics, yielding error rates exceeding 40% in multi-step counterfactuals and evidencing no path to human-like causal realism without fundamental redesign. These gaps persist despite parameter scaling, as inference remains associational, prone to adversarial exploits that invert causal directions without altering surface statistics.

Primary Approaches

Symbolic and Knowledge-Based Methods

Symbolic and knowledge-based methods in artificial intelligence represent commonsense knowledge through explicit symbolic structures, such as logical predicates, rules, and ontologies, enabling rule-based inference mechanisms like deduction and abduction to simulate human-like reasoning. These approaches encode facts, relations, and heuristics in formal languages, contrasting with implicit pattern recognition by prioritizing verifiable logical chains over probabilistic approximations.[52] Hand-coding techniques exemplify this paradigm, as seen in projects employing predicate logic to formalize millions of axioms capturing everyday assumptions, such as temporal persistence or physical constraints. For instance, systems built on first-order logic facilitate forward and backward chaining to derive conclusions from encoded premises, allowing transparent tracing of inferential steps. Development of such bases often spans decades due to the manual curation required, with efforts yielding large but incomplete repositories after extensive expert labor.[52] The primary strength of these methods lies in their interpretability and capacity for causal transparency, where rules explicitly model dependencies and mechanisms, enabling verification of reasoning paths against first-principles causality rather than mere correlations. This rigor supports applications demanding accountability, such as safety-critical deductions, by avoiding the opacity inherent in data-driven alternatives.[53] However, hand-coding imposes severe scalability limits through the "knowledge acquisition bottleneck," where human experts must articulate and verify vast domains of implicit commonsense, resulting in protracted timelines and persistent gaps in coverage for nuanced or context-dependent scenarios. Maintenance demands ongoing manual updates to incorporate new insights, exacerbating costs and risking obsolescence in dynamic environments.[54][55] Graph-based representations complement hand-coding by structuring knowledge as interconnected nodes and edges denoting relations like "causes" or "partOf," facilitating relational queries and path-based inference for commonsense retrieval. These graphs support multilingual coverage through aggregated sources, enabling systems to traverse links for associative reasoning, such as inferring tool uses from functional ties.[45][56] Despite efficiencies in querying, graph methods suffer from incompleteness in representing sequential or probabilistic events, often missing chained causal dynamics due to reliance on static, crowdsourced assertions that overlook rare or evolving patterns. This leads to brittle performance in inference engines handling real-world variability, underscoring the trade-off between structured explicitness and exhaustive empirical breadth.[45][53]

Statistical and Machine Learning Techniques

Statistical and machine learning techniques for commonsense knowledge in AI primarily rely on learning probabilistic associations from large-scale data to infer everyday relations, events, and inferences, contrasting with rule-based symbolic approaches by emphasizing pattern recognition over explicit logic. These methods leverage neural architectures, such as transformers, to embed concepts into vector spaces or complete knowledge graphs by predicting likely connections based on training distributions. For instance, COMET, introduced in 2019, employs transformer models fine-tuned on existing commonsense knowledge bases like ConceptNet and ATOMIC to automatically generate relational triples (e.g., "personX buys X → receives Y"), enabling scalable expansion of graphs with inferred everyday knowledge.[7] A key strength lies in the volume and efficiency of generation: COMET produces inferences that achieve approximately 77% alignment with human judgments in quality assessments, demonstrating effectiveness for in-distribution completion tasks where training data patterns suffice.[7] However, these embedding and graph-based methods exhibit brittleness outside trained distributions, as evidenced by sharp performance declines in out-of-distribution evaluations; for example, transformer-derived inferences fail to generalize to novel scenarios or abstract variations, prioritizing memorized correlations over adaptable reasoning. In large language models (LLMs), pre-training on web-scale corpora—often trillions of tokens from uncurated internet text—allows statistical approximation of commonsense through next-token prediction, enabling inferences on benchmarks like physical dynamics or social norms via fine-tuning on labeled tasks.[57] This data-driven paradigm scales with compute and volume, but inherits biases from source material, such as cultural skews or factual inconsistencies in web content, which amplify erroneous "commonsense" outputs in downstream applications.[58] Critiques highlight fundamental causal weaknesses: these techniques capture surface-level statistical dependencies rather than underlying mechanisms, leading to failures in counterfactual reasoning or abstract generalization, as shown in 2025 benchmarks where LLMs underperform on causal event abstraction despite strong in-domain recall.[59] Empirical studies reveal persistent gaps in explicit causal inference, with models confabulating plausible but unverifiable chains due to over-reliance on frequency over verifiability, underscoring the need for causal structures beyond correlational learning.[60]

Hybrid and Emerging Paradigms

Hybrid approaches in artificial intelligence seek to merge symbolic methods, which provide structured logical reasoning, with statistical techniques, such as neural networks, to address limitations in commonsense knowledge acquisition and application. Neuro-symbolic systems, for instance, embed logical rules into differentiable frameworks, allowing neural components to learn probabilistic patterns while enforcing symbolic constraints for interpretability and consistency.[61] This integration aims to enable scalable reasoning over incomplete or noisy data, where pure symbolic systems falter due to knowledge incompleteness and statistical models lack causal transparency.[62] Logic Tensor Networks (LTN), formalized in 2020, exemplify this paradigm by representing logical formulas as tensor operations within neural architectures, facilitating joint optimization of data fitting and rule satisfaction through gradient-based learning.[63] LTN supports commonsense tasks by querying hybrid knowledge bases, where satisfaction degrees quantify adherence to axioms like transitivity or causality, achieving reported accuracies exceeding 90% in controlled semantic parsing benchmarks as of 2021 evaluations.[64] Extensions in neuro-symbolic theorem proving have applied these to conversational commonsense inference, parsing unstated assumptions via combined neural pattern recognition and symbolic deduction, with user studies validating plausibility in evoking implicit knowledge.[65] Emerging grounded learning paradigms emphasize embodiment to anchor abstract commonsense in perceptual-motor experiences, often via simulated environments that simulate physical dynamics for robotics. Post-2023 advancements tie sensory inputs to knowledge graphs through reinforcement learning in virtual worlds, enabling agents to infer causal relations from action-outcome sequences rather than textual corpora alone.[66] These methods promote causal realism by requiring models to predict intervention effects in grounded scenarios, such as object manipulation chains, fostering robustness beyond correlation-based predictions.[67] Preliminary empirical results from 2024 neuro-symbolic prototypes show gains in causal chain prediction, with hybrid models outperforming neural baselines by 15-20% in sequence consistency on synthetic reasoning datasets, attributed to explicit rule enforcement.[62] However, these remain confined to narrow domains, unproven at population-scale commonsense coverage, underscoring the necessity for benchmarks evaluating counterfactual reasoning and long-horizon causality to validate broader efficacy.[61]

Key Projects, Knowledge Bases, and Datasets

Major Knowledge Bases

The Cyc project, launched in 1984 by Douglas Lenat at Microelectronics and Computer Technology Corporation, represents an early axiomatic effort to manually encode millions of commonsense facts, rules, and ontological concepts into a machine-readable knowledge base for enabling deductive reasoning.[68] By 2025, the proprietary Cyc corpus includes approximately 30 million logical rules, derived from over 2,000 person-years of expert knowledge engineering, emphasizing formal inference over vast domains from physics to everyday activities.[69] Its strengths lie in the depth of logical structure, supporting complex theorem proving and disambiguation in controlled settings, yet it faces criticisms for incompleteness in dynamic, socially nuanced scenarios—such as implicit cultural norms or probabilistic human behaviors—despite investments surpassing $200 million, with limited evidence of scalable real-world reasoning maturity.[69] [70] OpenCyc, the open-source release of a subset of Cyc technology initiated in the early 2000s, provides public access to hundreds of thousands of hierarchically organized terms and microtheories, facilitating research into ontology-driven commonsense without proprietary constraints.[71] This version retains Cyc's emphasis on predicate logic but scales more modestly, serving as a foundation for hybrid AI extensions while inheriting critiques of manual curation's brittleness against evolving knowledge.[72] ConceptNet, originating in 2004 from MIT's Media Lab, forms a crowdsourced, multilingual knowledge graph connecting concepts via relational triples (e.g., "isA," "usedFor") to capture associative commonsense across semantics, pragmatics, and world knowledge.[45] By 2025, it encompasses millions of edges spanning over 130 languages and 36 relation types, aggregated from sources like WordNet, Wiktionary, and public contributions, enabling broad coverage of linguistic and conceptual links.[73] Achievements include its utility in augmenting natural language processing for inference tasks, but its decentralized assembly yields shallower encodings—often probabilistic assertions lacking rigorous validation—resulting in noise, redundancy, and gaps in specialized or rare-domain depth compared to axiomatically rigorous alternatives.[74] ATOMIC, introduced in 2019 by researchers at the Allen Institute for AI, constitutes an event-centric repository of 877,000 crowdsourced textual inferences modeling if-then relations for everyday events across nine categories, including causes, effects, and social preconditions.[75] This structure prioritizes temporal and causal dynamics over static facts, achieving dense coverage of plausible event outcomes through human annotation of verb-noun phrases.[76] COMET, a 2019 transformer model trained on ATOMIC and similar graphs, extends it by generating novel inferences, with human evaluations rating up to 77.5% of outputs as high-quality for ATOMIC-style relations, though extensions like COMET-ATOMIC 2020 reveal variability in generated fidelity, including hallucinations or misalignment with verified commonsense (e.g., partial overlaps with ConceptNet but inconsistent novelty validation).[7] [77] These neural-symbolic expansions enhance scalability but introduce dependencies on training data quality, limiting reliability in uncharted event chains.[78]

Benchmarks and Evaluation Datasets

The PIQA dataset, introduced in November 2019, evaluates physical commonsense reasoning by presenting multiple-choice questions that require selecting plausible next steps for goal-directed actions in everyday physical scenarios, such as manipulating objects.[79] Constructed from crowdsourced premises involving intuitive physical interactions, it reveals AI systems' difficulties in simulating causal chains of physical events without rote memorization, as models often select implausible alternatives that violate basic mechanics like object permanence or force dynamics.[79] HellaSwag, released in May 2019, tests situated commonsense inference through adversarial multiple-choice tasks where models must choose the most plausible sentence continuation for activity video captions with endings removed and foil options generated via language models to mimic errors.[80] This design counters superficial pattern matching, exposing gaps in understanding temporal and social causalities, such as predicting outcomes in human-object interactions, where systems falter on non-stereotypical but logically coherent paths.[80] Subsets of the GLUE and SuperGLUE benchmarks, including Winograd schema tasks dating to the 2012 Winograd Schema Challenge, probe pronoun resolution and basic physical/social reasoning through paired sentences with ambiguities resolvable only via external world knowledge, such as spatial relations or intentionality.[81] These reveal enduring challenges in disambiguating coreferences without syntactic hints, with adversarial extensions like WinoGrande amplifying drops in accuracy by introducing scale and novelty that test against overfitting to limited priors.[82] Evaluations from 2023 to 2025 using the ConceptNet knowledge graph for abstract commonsense, such as systematic probing of relational inferences between concepts, highlight persistent causal gaps in linking abstract entities via mechanisms like causation or opposition, favoring objective graph traversal metrics over narrative-based assessments. These datasets underscore the need for benchmarks prioritizing verifiable relational consistency to discern genuine causal modeling from associative shortcuts.

Advances in Modern AI Systems

Integration with Large Language Models

Large language models (LLMs) incorporate approximations of commonsense knowledge primarily through massive scaling of parameters and pretraining on datasets exceeding trillions of tokens, enabling statistical inference of everyday patterns without dedicated commonsense modules. Models like GPT-3, released in June 2020 with 175 billion parameters, exhibited initial emergent capabilities in tasks requiring rudimentary commonsense, such as physical intuition or social norms, via in-context learning during inference. Subsequent models, including GPT-4 launched in March 2023, amplified these traits, displaying unpredictable jumps in performance on commonsense benchmarks as scale increased, though such emergences often reflect metric discontinuities rather than qualitative shifts in understanding.[83] Fine-tuning and prompting strategies further integrate commonsense-like reasoning into LLMs post-2020. Techniques such as chain-of-thought (CoT) prompting, introduced in 2022, elicit step-by-step decomposition for causal and commonsense tasks, boosting accuracy on arithmetic, symbolic, and intuitive physics problems by simulating human-like deliberation without altering core weights.[84] For instance, CoT variants tailored for causal inference, like Causal Chain-of-Prompting in 2024 frameworks, guide LLMs to extract and chain causal graphs from prompts, improving judgments on event dependencies central to commonsense. These methods leverage the models' latent knowledge from pretraining but remain sensitive to prompt phrasing, highlighting reliance on surface-level pattern matching over innate causal mechanisms. Retrieval-augmented generation (RAG), maturing in 2023 paradigms, augments LLMs with external commonsense knowledge bases like ConceptNet to address parametric gaps during generation. By retrieving relational triples—e.g., "person" enables "walk"—RAG setups dynamically inject structured priors into LLM outputs, enhancing factual consistency in commonsense queries over pure generation.[85] Advantages include scalable access to curated graphs beyond training data, yet drawbacks encompass added inference latency from retrieval overhead and inconsistencies from noisy or irrelevant matches, limiting reliability in real-time applications.[86] Surveys from 2024–2025 underscore that these integrations yield incremental gains in commonsense handling but fail to deliver a fundamental paradigm shift, as LLMs continue deriving inferences from correlational data distributions rather than robust causal models of the world. Pretraining on vast corpora captures probabilistic associations mimicking commonsense, yet persistent failures in counterfactual or adversarial scenarios reveal shallow encoding without true generalization. Integration efforts thus prioritize practical augmentation over resolving core representational deficits in causal realism.

Empirical Performance and Limitations

Large language models demonstrate impressive performance on standardized commonsense benchmarks, often exceeding 85% accuracy on tasks like those in SuperGLUE, which include elements of commonsense inference such as causal and temporal reasoning in datasets like COPA and WSC.[87] For instance, GPT-4 achieved scores around 90-95% on commonsense-heavy subsets like HellaSwag and ARC-Challenge in 2023 evaluations, reflecting strong pattern-matching capabilities trained on vast corpora.[88] Comparable results appear in models like PaLM-2, which similarly topped SuperGLUE leaderboards with gains in commonsense reading comprehension, and recent iterations of Grok, which outperform baselines on analogous reasoning tasks.[89][90] Despite these benchmark successes, performance reveals stark limitations when probing deeper causal and abstract reasoning, where accuracy often falls to around 50% or below on specialized tests.[91] In benchmarks like ACCESS for abstract causal event discovery, large language models struggle with counterfactual reasoning over causal graphs, even when augmented, highlighting reliance on statistical correlations rather than genuine causal understanding.[59] Similarly, CausalARC tasks, introduced in 2025, expose failures in generating interventions within specified world models, with models exhibiting brittleness beyond memorized patterns.[92] A core limitation stems from the absence of embodied priors, as AI systems lack the grounded, physical interactions that humans use to build intuitive physics and causality; for example, models routinely err in simulating object permanence or material properties in novel scenarios, mistaking textual descriptions for real-world dynamics.[93] This manifests in systematic failures on physical counterfactuals, such as incorrectly predicting outcomes of hypothetical events like "a glass shattering differently if dropped on carpet versus concrete," where LLMs prioritize superficial linguistic cues over mechanical intuition.[19] Gary Marcus and Ernest Davis emphasized in 2025 that commonsense reasoning remains unsolved after 70 years of AI research, with large models prone to confabulations in unscripted domains, debunking narratives of near-resolution despite benchmark hype.[19] These gaps underscore that high scores on static tests do not equate to robust, human-like commonsense, as models falter on compositional or out-of-distribution challenges requiring causal realism.[94]

Applications and Real-World Deployment

Current Uses Across Domains

In natural language processing, commonsense knowledge enhances question answering and story generation by providing contextual inferences beyond explicit text. The COMET model, introduced in 2019, generates diverse commonsense descriptions from input events, enabling automatic knowledge graph expansion for tasks like narrative completion.[95] Extensions such as TaCOMET, developed in 2024, incorporate temporal awareness to produce time-controlled inferences, improving dialogue systems and event sequencing in stories.[96] Similarly, GD-COMET adapts culturally nuanced knowledge for global NLP applications, boosting performance in cross-lingual QA.[97] In robotics, commonsense priors guide navigation and object manipulation by integrating semantic expectations with sensory data. The SEEK framework, proposed in 2024, combines prior spatial knowledge with commonsense reasoning to enable probabilistic object goal navigation in real-world environments, prioritizing likely locations based on everyday object affordances. Commonsense-aware approaches, such as object value graphs, further refine high-level planning by scoring subgoals according to relational priors, enhancing exploration efficiency in cluttered scenes.[98] Vision-language models leverage commonsense for scene grounding, aligning textual descriptions with visual elements through inferential knowledge. Multimodal systems like those in MAGIC-VQA, from 2025, use ATOMIC-derived if-then reasoning to perform grounded entailment in video QA, linking visual events to plausible outcomes.[99] SceneVerse, scaled in 2024 for 3D environments, incorporates commonsense alignments to improve object detection and spatial querying, facilitating robust scene understanding in embodied AI. These integrations yield empirical gains, such as reduced inference errors in event prediction tasks by embedding relational priors, with studies showing up to 25% accuracy improvements in narrative why-question answering via external commonsense augmentation.[100]

Barriers to Effective Implementation

AI systems augmented with commonsense knowledge encounter scalability barriers in real-world deployment due to the substantial computational overhead of inference, which involves querying vast knowledge structures and simulating contextual reasoning in real time. This resource intensity restricts deployment on edge devices or in latency-sensitive applications, such as mobile robotics, where processing delays can render systems impractical.[30][101] Brittleness poses a core implementation challenge, as these systems often fail in dynamic environments lacking comprehensive training coverage, particularly when physical grounding is absent, leading to errors in handling edge cases like unexpected obstacles or altered spatial configurations. In robotics, for instance, agents trained on conventional setups may navigate standard kitchens effectively but collapse when confronted with minor deviations, such as a displaced appliance, due to an inability to intuitively apply priors like object stability or interaction dynamics.[30][102][103] Safety concerns arise from the potential for unsafe outputs stemming from incomplete commonsense, including hallucinated recommendations that overlook fundamental causal mechanisms, thereby risking user harm in advisory or control contexts. Deployment reports from 2025 indicate persistent vulnerabilities, with models exhibiting unpredictable responses in novel scenarios that demand innate world understanding.[104][101] Case studies in autonomous systems illustrate these risks, where failures to incorporate physics priors—such as trajectory prediction under perturbations—have precipitated incidents, including misrecognition of altered traffic signage or pedestrian behaviors, underscoring the gap between simulated performance and robust field operation.[105][106][30]

Criticisms and Controversies

Technical and Philosophical Shortcomings

Current AI systems exhibit significant technical deficiencies in commonsense reasoning, primarily due to their reliance on statistical pattern matching rather than explicit causal models. In a February 2024 study by the Information Sciences Institute (ISI), large language models (LLMs) demonstrated persistent errors in everyday question-answering tasks, confidently preferring incorrect answers over plausible alternatives, indicating a lack of genuine probabilistic or causal inference.[107] This shortfall stems from the absence of built-in causal reasoning mechanisms; systems trained on vast corpora capture correlations but falter when required to infer unstated causes or effects in novel scenarios, as evidenced by failures on benchmarks like physical dynamics or social inference.[1] Proponents of machine learning scalability posit that increasing model size, data volume, and compute will eventually emergent robust commonsense through denser pattern representations, drawing on observed gains in narrower tasks.[101] However, empirical evidence counters this optimism: as detailed in analyses by Ernest Davis and Gary Marcus, core commonsense deficits—such as intuitive physics, object persistence, and agent intentions—have endured across seven decades of AI research and multiple scaling epochs, with 2024-2025 benchmarks revealing no qualitative breakthroughs despite parameter counts exceeding trillions.[19] A 2025 survey of commonsense benchmarks underscores these "scalability walls," where performance plateaus on tasks demanding abstraction beyond training distributions, highlighting the limits of brute-force approaches without hybrid symbolic integration.[5] Philosophically, John Searle's Chinese Room argument remains pertinent, illustrating that syntactic manipulation of symbols—as in LLMs processing tokens—does not equate to semantic understanding or intentionality, a critique amplified in modern contexts where AI generates coherent outputs without grasping referential meaning or contextual causality.[108] This aligns with embodied cognition theories, where George Lakoff's 1987 framework posits commonsense as rooted in sensorimotor experiences and metaphorical mappings from physical interactions, rendering disembodied AI inherently limited in acquiring grounded causal knowledge.[109] Davis and Marcus extend this by arguing that true commonsense necessitates innate priors and embodied constraints absent in current architectures, perpetuating failures in causal realism despite computational mimicry.[1]

Overhype and Systemic Failures in Progress Claims

Media and industry proponents frequently portrayed the release of GPT-4 in March 2023 as a breakthrough toward human-like intelligence, including claims of "sparks of artificial general intelligence" demonstrated through novel problem-solving and commonsense inference in early experiments.[110] However, these assertions overstated capabilities, as GPT-4 and subsequent large language models (LLMs) rely on statistical pattern matching from vast training data rather than genuine causal understanding or robust commonsense reasoning, leading to persistent failures on targeted benchmarks like HellaSwag, which tests everyday inference.[111] Critics such as Gary Marcus have highlighted that, despite scaling compute and data, LLMs exhibit an "illusion of intelligence," failing embarrassingly on newer evaluations of understanding that probe beyond memorized patterns, with core challenges in commonsense acquisition remaining unsolved after 70 years of AI research.[19][112] Systemic pressures exacerbate these gaps, as venture funding and corporate investments disproportionately favor brute-force scaling of transformer architectures—exemplified by billions poured into models like GPT series—over rigorous hybrid approaches integrating symbolic reasoning or causal models that could address commonsense deficits. This bias, often amplified by tech industry affiliations undisclosed in research, prioritizes superficial metric improvements on contaminated benchmarks over verifiable progress in reliability, resulting in models that score high on average but collapse on adversarial or real-world commonsense tasks due to spurious correlations.[113] Academic and media narratives, influenced by institutional optimism, normalize concepts like "emergent abilities" to justify continued scaling despite empirical evidence of brittleness, sidelining data showing no proportional gains in causal realism for commonsense domains.[114] Debates over progress timelines underscore these failures, with optimists forecasting AGI—implying mastery of commonsense—within the decade based on benchmark trends, while skeptics cite stagnant performance on core reasoning tests and historical overpromises to argue for decades-long horizons or fundamental paradigm shifts.[115] Surveys of AI experts in 2025 median AGI arrival around 2040-2050 with high uncertainty, privileging data over hype as LLMs continue failing systematic evaluations of commonsense, such as distinguishing real-world physics from software artifacts.[116][19] This divide reflects not just technical disagreement but evidentiary caution, as recent analyses reveal benchmark pollution masking true stagnation in acquiring human-like intuitive knowledge.[117]

Future Directions

Promising Research Trajectories

Neuro-symbolic approaches integrate neural networks with symbolic structures, such as knowledge graphs (KGs), to enhance commonsense reasoning by combining probabilistic pattern recognition with explicit logical inference. In 2025, hybrid inference systems augmented LLMs with KGs have demonstrated improved factual accuracy and explainability, addressing limitations in pure transformer-based models that struggle with consistent rule application in commonsense scenarios. For instance, KG-enhanced LLMs during pre-training facilitate structured knowledge acquisition, yielding up to 15-20% gains in logical entailment tasks over baseline LLMs, as evaluated in design pattern studies.[118] Gartner's 2025 AI Hype Cycle positions neuro-symbolic AI as a maturing technology for scalable symbolic-neural fusion, enabling systems to output interpretable causal chains relevant to everyday commonsense queries.[119] Embodied AI trajectories leverage simulation environments and robotics datasets to ground abstract commonsense in physical interactions, fostering models that predict object affordances and spatial relations from real-world data. Post-2023 datasets, including those from mobile manipulators, have enabled world models that simulate embodied agents, achieving denser annotations for language-to-3D mapping with datasets like 3D-GRAND released in June 2025, which supports training on over 10 million annotated scenes for robotic grounding.[120] Comprehensive surveys highlight how these simulation worlds reduce sim-to-real gaps, with embodied models outperforming disembodied LLMs by 25-30% in physical commonsense benchmarks involving manipulation and navigation.[121] Causal-focused methods, incorporating Judea Pearl's ladder of causation into KGs and transformers, promise to elevate commonsense from associative correlations to interventional understanding, such as predicting outcomes from hypothetical actions. Recent 2025 works integrate causal graphs into transformer layers for dynamic updates, enabling Level 2 (interventional) reasoning without full interventions, as shown in frameworks achieving 10-15% higher accuracy on counterfactual commonsense tasks compared to standard fine-tuning.[122] Axiomatic training on causal rules further embeds Pearl's hierarchy, allowing transformers to climb from observational to counterfactual rungs, with empirical results from ICML 2025 posters validating robustness in sparse-data commonsense domains like social inference.[123]

Fundamental Open Questions

A core unresolved question in AI commonsense knowledge concerns whether disembodied systems—those trained solely on textual data without physical embodiment—can achieve genuine causal understanding of the world, as opposed to mere statistical correlations. Causal realism demands grasping interventions and counterfactuals, such as predicting outcomes from hypothetical actions in physical environments, which current large language models (LLMs) often fail due to their reliance on observational data patterns rather than experiential causality.[124] For instance, LLMs excel at associating "dropping a glass" with "breaking" from co-occurrence in training corpora but struggle with novel causal chains involving unseen interactions, highlighting a disconnect from first-principles physical dynamics.[125] This limitation persists even in scaled models, as empirical tests reveal persistent errors in tasks requiring causal inference beyond memorized heuristics.[126] Debates persist on whether further scaling of parameters and data suffices for commonsense acquisition or if architectural shifts, such as embodiment through robotic interaction, are indispensable. Proponents of scaling argue that emergent abilities arise from vast compute, yet 2025 analyses indicate diminishing returns for abstract reasoning voids, with models inverting performance on certain causal tasks as size increases—a phenomenon termed inverse scaling.[101] Conversely, embodiment advocates posit that physical grounding is essential for intuitive causal realism, as disembodied training lacks the sensorimotor feedback humans use to build commonsense ontologies; experiments with embodied agents show improved physical prediction but remain narrow and data-hungry.[15] No consensus exists, with empirical voids in long-term causal generalization underscoring architecture's potential primacy over brute scale.[127] Measuring true commonsense remains elusive, as prevailing benchmarks like multiple-choice questionnaires (e.g., CommonsenseQA) assess superficial plausibility rather than robust, context-adaptive reasoning in dynamic settings. Real-world interaction tests, such as embodied navigation or multi-step causal planning, reveal brittleness: models falter on perturbations requiring on-the-fly causal adjustment, unlike human commonsense derived from lifelong embodiment.[128] Proposed alternatives, including agentic simulations with verifiable outcomes, expose annotation noise and failure to capture tacit physical priors, yet no standardized metric bridges lab artifacts to causal fidelity.[129][5] If disembodied AI cannot surmount these barriers, implications extend to capping progress toward artificial general intelligence (AGI), confining systems to pattern-matching simulacra absent genuine world-modeling. Empirical gaps in causal voids, evident in 2025 benchmarks, suggest commonsense may necessitate hybrid bio-inspired paradigms or prove inherently substrate-dependent, rendering pure computational scaling insufficient for AGI-level robustness.[101] Such uncertainties demand scrutiny of overreliance on proxy metrics, prioritizing causal validation over benchmark saturation.[130]

References

User Avatar
No comments yet.