Hubbry Logo
MultimodalityMultimodalityMain
Open search
Multimodality
Community hub
Multimodality
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Multimodality
Multimodality
from Wikipedia
Example of multimodality: A televised weather forecast (medium) involves understanding spoken language, written language, weather specific language (such as temperature scales), geography, and symbols (clouds, sun, rain, etc.).

Multimodality is the application of multiple literacies within one medium. Multiple literacies or "modes" contribute to an audience's understanding of a composition.[1] Everything from the placement of images to the organization of the content to the method of delivery creates meaning. This is the result of a shift from isolated text being relied on as the primary source of communication, to the image being utilized more frequently in the digital age.[2] Multimodality describes communication practices in terms of the textual, aural, linguistic, spatial, and visual resources used to compose messages.[3]

While all communication, literacy, and composing practices are and always have been multimodal,[4] academic and scientific attention to the phenomenon only started gaining momentum in the 1960s. Work by Roland Barthes and others has led to a broad range of disciplinarily distinct approaches. More recently, rhetoric and composition instructors have included multimodality in their coursework. In their position statement on Understanding and Teaching Writing: Guiding Principles, the National Council of Teachers of English state that "'writing' ranges broadly from written language (such as that used in this statement), to graphics, to mathematical notation."[5]

Definition

[edit]

Although multimodality discourse mentions both medium and mode, these terms are not synonymous. However, they may overlap depending on how precisely (or not) individual authors and traditions use the terms.

Gunther Kress's scholarship on multimodality is canonical with social semiotic approaches and has considerable influence in many other approaches, such as in writing studies. Kress defines 'mode' in two ways. One: a mode is a type of material resource that is socially or culturally shaped to make meaning. Images, writing, speech and gesture are all examples of modes.[6] Two: modes are semiotic, shaped by intrinsic characteristics and their potential within their medium, as well as what is required of them by their culture or society.[7]

Thus, every mode has a distinct historical and cultural potential and or limitation for its meaning.[8] For example, if we broke down writing into its modal resources, we would have grammar, vocabulary, and graphic "resources" as the acting modes. Graphic resources can be further broken down into font size, type, color, size, spacing within paragraphs, etc. However, these resources are not deterministic. Instead, modes shape and are shaped by the systems in which they participate. Modes may aggregate into multimodal ensembles and be shaped over time into familiar cultural forms. A good example of this is films, which combine visual modes (in setting and in attire), modes of dramatic action and speech, and modes of music or other sounds. Studies of multimodal work in this field include van Leeuwen;[9] Bateman and Schmidt;[10] and Burn and Parker's theory of the Kineikonic Mode.[11]

In social semiotic accounts, a medium is the substance in which meaning is realized and through which it becomes available to others. Mediums include video, image, text, audio, etc. Socially, a medium includes semiotic, sociocultural, and technological practices. Examples include film, newspapers, billboards, radio, television, a classroom, etc. Multimodality also makes use of the electronic medium by creating digital modes with the interlacing of image, writing, layout, speech, and video. Mediums have become modes of delivery that consider the current and future contexts.

History

[edit]

Multimodality (as a phenomenon) has received increasingly theoretical characterizations throughout the history of communication. Indeed, the phenomenon has been studied at least since the 4th century BC, when classical rhetoricians alluded to it with their emphasis on voice, gesture, and expressions in public speaking.[12][13] However, the term was not defined with significance until the 20th century. During this time, an exponential rise in technology created many new modes of presentation. Since then, multimodality has become standard in the 21st century, applying to various network-based forms such as art, literature, social media and advertising. The monomodality, or singular mode, which used to define the presentation of text on a page has been replaced with more complex and integrated layouts. John A. Bateman says in his book Multimodality and Genre, "Nowadays… text is just one strand in a complex presentational form that seamlessly incorporates visual aspect 'around,' and sometimes even instead of, the text itself."[14] Multimodality has quickly become "the normal state of human communication."[4]

Expressionism

[edit]

During the 1960s and 1970s, many writers looked to photography, film, and audiotape recordings in order to discover new ideas about composing.[15] This led to a resurgence of a focus on the sensory, self-illustration known as expressionism. Expressionist ways of thinking encouraged writers to find their voice outside of language by placing it in a visual, oral, spatial, or temporal medium.[16] Donald Murray, who is often linked to expressionist methods of teaching writing once said, "As writers it is important that we move out from that which is within us to what we see, feel, hear, smell, and taste of the world around us. A writer is always making use of experience." Murray instructed his writing students to "see themselves as cameras" by writing down every single visual observation they made for one hour.[16] Expressionist thought emphasized personal growth, and linked the art of writing with all visual art by calling both a type of composition. Also, by making writing the result of a sensory experience, expressionists defined writing as a multisensory experience, and asked for it to have the freedom to be composed across all modes, tailored for all five senses.

Cognitive developments

[edit]

During the 1970s and 1980s, multimodality was further developed through cognitive research about learning. Jason Palmeri cites researchers such as James Berlin and Joseph Harris as being important to this development; Berlin and Harris studied alphabetic writing and how its composition compared to art, music, and other forms of creativity.[16] Their research had a cognitive approach which studied how writers thought about and planned their writing process. James Berlin declared that the process of composing writing could be directly compared to that of designing images and sound.[17] Furthermore, Joseph Harris pointed out that alphabetic writing is the result of multimodal cognition. Writers often conceptualize their work by non-alphabetic means, through visual imagery, music, and kinesthetic feelings.[18] This idea was reflected in the popular research of Neil D. Fleming, more commonly known as the neuro-linguistic learning styles. Fleming's three styles of auditory, kinesthetic, and visual learning helped to explain the modes in which people were best able to learn, create, and interpret meaning. Other researchers such as Linda Flower and John R. Hayes theorized that alphabetic writing, though it is a principal modality, sometimes could not convey the non-alphabetic ideas a writer wished to express.[19]

Audience

[edit]

Every text has its own defined audience, and makes rhetorical decisions to improve the audience's reception of that same text. In this same manner, multimodality has evolved to become a sophisticated way to appeal to a text's audience. In 1984, Lisa Ede and Andrea Lunsford provided a framework for discussing audience that included addressed audiences, and invoked, or imagined, audiences. While conversations surrounding audience have continued since 1984, it is important to consider the role of audience (whether addressed or invoked) while engaging in multimodal composing.

Relying upon the canons of rhetoric in a different way than before, multimodal texts have the ability to address a larger, yet more focused, intended audience. Multimodality does more than solicit an audience; the effects of multimodality are imbedded in an audience's semiotic, generic and technological understanding.

Psychological effects

[edit]

The appearance of multimodality, at its most basic level, can change the way an audience perceives information. The most basic understanding of language comes via semiotics – the association between words and symbols. A multimodal text changes its semiotic effect by placing words with preconceived meanings in a new context, whether that context is audio, visual, or digital. This in turn creates a new, foundationally different meaning for an audience. Bezemer and Kress, two scholars on multimodality and semiotics, argue that students understand information differently when text is delivered in conjunction with a secondary medium, such as image or sound, than when it is presented in alphanumeric format only. This is due to it drawing a viewer's attention to "both the originating site and the site of recontextualization".[20] Meaning is moved from one medium to the next, which requires the audience to redefine their semiotic connections. Recontextualizing an original text within other mediums creates a different sense of understanding for the audience, and this new type of learning can be controlled by the types of media used.

Multimodality also can be used to associate a text with a specific argumentative purpose, e.g., to state facts, make a definition, cause a value judgment, or make a policy decision. Jeanne Fahnestock and Marie Secor, professors at the University of Maryland and the Pennsylvania State University, labeled the fulfillment of these purposes stases.[21] A text's stasis can be altered by multimodality, especially when several mediums are juxtaposed to create an individualized experience or meaning. For example, an argument that mainly defines a concept is understood as arguing in the stasis of definition; however, it can also be assigned a stasis of value if the way the definition is delivered equips writers to evaluate a concept, or judge whether something is good or bad. If the text is interactive, the audience is facilitated to create their own meaning from the perspective the multimodal text provides. By emphasizing different stases through the use of different modes, writers are able to further engage their audience in creating comprehension.

Genre effects

[edit]

Multimodality also obscures an audience's concept of genre by creating gray areas out of what was once black and white. Carolyn R. Miller, a distinguished professor of rhetoric and technical communication at North Carolina State University observed in her genre analysis of the Weblog how genre shifted with the invention of blogs, stating that "there is strong agreement on the central features that make a blog a blog. Miller defines blogs on the basis of their reverse chronology, frequent updating, and combination of links with personal commentary.[22] However, the central features of blogs are obscured when considering multimodal texts. Some features are absent, such the ability for posts to be independent of each other, while others are present. This creates a situation where the genre of multimodal texts is impossible to define; rather, the genre is dynamic, evolutionary and ever-changing.

The delivery of new texts has radically changed along with technological influence. Composition now consists of the anticipation of future remediation. Writers think about the type of audience a text will be written for, and anticipate how that text might be reformed in the future. Jim Ridolfo coined the term rhetorical velocity to explain a conscious concern for the distance, speed, time, and travel it will take for a third party to rewrite an original composition.[23] The use of recomposition allows for an audience to be involved in a public conversation, adding their own intentionality to the original product. This new method of editing and remediation is attributed to the evolution of digital text and publication, giving technology an important role in writing and composition.

Technological effects

[edit]

Multimodality has evolved along with technology. This evolution has created a new concept of writing, a collaborative context keeping the reader and writer in relationship. The concept of reading is different with the influence of technology due to the desire for a quick transmission of information. In reference to the influence of multimodality on genre and technology, Professor Anne Frances Wysocki expands on how reading as an action has changed in part because of technology reform: "These various technologies offer perspectives for considering and changing approaches we have inherited to composing and interpreting pages....".[24] Along with the interconnectedness of media, computer-based technologies are designed to make new texts possible, influencing rhetorical delivery and audience.

Education

[edit]

Multimodality in the 21st century has caused educational institutions to consider changing the forms of its traditional aspects of classroom education. With a rise in digital and Internet literacy, new modes of communication are needed in the classroom in addition to print, from visual texts to digital e-books. Rather than replacing traditional literacy values, multimodality augments and increases literacy for educational communities by introducing new forms. According to Miller and McVee, authors of Multimodal Composing in Classrooms, "These new literacies do not set aside traditional literacies. Students still need to know how to read and write, but new literacies are integrated."[25] The learning outcomes of the classroom stay the same, including – but are not limited to – reading, writing, and language skills. However, these learning outcomes are now being presented in new forms as multimodality in the classroom which suggests a shift from traditional media such as paper-based text to more modern media such as screen-based texts. The choice to integrate multimodal forms in the classroom is still controversial within educational communities. The idea of learning has changed over the years and now, some argue, must adapt to the personal and affective needs of new students. In order for classroom communities to be legitimately multimodal, all members of the community must share expectations about what can be done with through integration, requiring a "shift in many educators' thinking about what constitutes literacy teaching and learning in a world no longer bound by print text."[26]

Multiliteracy

[edit]

Multilteracy is the concept of understanding information through various methods of communication and being proficient in those methods. With the growth of technology, there are more ways to communicate than ever before, making it necessary for our definition of literacy to change in order to better accommodate these new technologies. These new technologies consist of tools such as text messaging, social media, and blogs.[27] However, these modes of communication often employ multiple mediums simultaneously such as audio, video, pictures, and animation. Thus, making content multimodal.

The culmination of these different mediums are what's called content convergence, which has become a cornerstone of multimodal theory.[28] Within our modern digital discourse content has become accessible to many, remixable, and easily spreadable, allowing ideas and information to be consumed, edited, and improved by the general public.[28] An example being Wikipedia, the platform allows free consumption and authorship of its work which in turn facilitates the spread of knowledge through the efforts of a large community. It creates a space in which authorship has become collaborative and the product of said authorship is improved by that collaboration. As distribution of information has grown through this process of content convergence it has become necessary for our understanding of literacy to evolve with it.[28]

The shift away from written text as the sole mode of nonverbal communication has caused the traditional definition of literacy to evolve.[29] While text and image may exist separately, digitally, or in print, their combination gives birth to new forms of literacy and thus, a new idea of what it means to be literate. Text, whether it is academic, social, or for entertainment purposes, can now be accessed in a variety of different ways and edited by several individuals on the Internet. In this way texts that would typically be concrete become amorphous through the process of collaboration. The spoken and written word are not obsolete, but they are no longer the only way to communicate and interpret messages.[29] Many mediums can be used separately and individually. Combining and repurposing one mode of communication for another has contributed to the evolution of different literacies.

Communication is spread across a medium through content convergence, such as a blog post accompanied by images and an embedded video. This idea of combining mediums gives new meaning to the concept of translating a message. The culmination of varying forms of media allows for content to be either reiterated, or supplemented by its parts. This reshaping of information from one mode to another is known as transduction.[29] As information changes from one mode to the next, our comprehension of its message is attributed to multiliteracy. Xiaolo Bao defines three succeeding learning stages that make up multiliteracy. Grammar-Translation Method, Communicative Method, and Task-Based Method. Simply put, they can be described as the fundamental understanding of syntax and its function, the practice of applying that understanding to verbal communication, and lastly, the application of said textual and verbal understandings to hands-on activities. In an experiment conducted by the Canadian Center of Science and Education, students were either placed in a classroom with a multimodal course structure, or a classroom with a standard learning course structure as a control group. Tests were administered throughout the length of the two courses, with the multimodal course concluding in a higher learning success rate, and reportedly higher rate of satisfaction among students. This indicates that applying multimodality to instruction is found to yield overall better results in developing multiliteracy than conventional forms of learning when tested in real-life scenarios.[30]

Classroom literacy

[edit]

Multimodality in classrooms has brought about the need for an evolving definition of literacy. According to Gunther Kress, a popular theorist of multimodality, literacy usually refers to the combination of letters and words to make messages and meaning and can often be attached to other words in order to express knowledge of the separate fields, such as visual- or computer-literacy. However, as multimodality becomes more common, not only in classrooms, but in work and social environments, the definition of literacy extends beyond the classroom and beyond traditional texts. Instead of referring only to reading and alphabetic writing, or being extended to other fields, literacy and its definition now encompass multiple modes. It has become more than just reading and writing, and now includes visual, technological, and social uses among others.[29]

Georgia Tech's writing and communication program created a definition of multimodality based on the acronym, WOVEN.[31] The acronym explains how communication can be written, oral, visual, electronic, and nonverbal. Communication has multiple modes that can work together to create meaning and understanding. The goal of the program is to ensure students are able to communicate effectively in their everyday lives using various modes and media.[31]

As classroom technologies become more prolific, so do multimodal assignments. Students in the 21st century have more options for communicating digitally, be it texting, blogging, or through social media.[32] This rise in computer-controlled communication has required classes to become multimodal in order to teach students the skills required in the 21st-century work environment.[32] However, in the classroom setting, multimodality is more than just combining multiple technologies, but rather creating meaning through the integration of multiple modes. Students are learning through a combination of these modes, including sound, gestures, speech, images and text. For example, in digital components of lessons, there are often pictures, videos, and sound bites as well as the text to help students grasp a better understanding of the subject. Multimodality also requires that teachers move beyond teaching with just text, as the printed word is only one of many modes students must learn and use.[29][32][33]

The application of visual literacy in English classroom can be traced back to 1946 when the instructor's edition of the popular Dick and Jane elementary reader series suggested teaching students to "read pictures as well as words" (p. 15).[34]  During the 1960s, a couple of reports issued by the National Council of Teachers of English suggested using television and other mass media such as newspapers, magazines, radio, motion pictures, and comic books in English classroom. The situation is similar in postsecondary writing instruction. Since 1972, visual elements have been incorporated into some popular twentieth-century college writing textbooks like James McCrimmon's Writing with a Purpose.[34]

Higher education

[edit]

Colleges and universities around the world are beginning to use multimodal assignments to adapt to the technology currently available. Assigning multimodal work also requires professors to learn how to teach multimodal literacy. Implementing multimodality in higher education is being researched to find out the best way to teach and assign multimodal tasks.[33]

Multimodality in the college setting can be seen in an article by Teresa Morell, where she discusses how teaching and learning elicit meaning through modes such as language, speaking, writing, gesturing, and space. The study observes an instructor who conducts a multimodal group activity with students. Previous studies observed different classes using modes such as gestures, classroom space, and PowerPoints. The current study observes an instructors combined use of multiple modes in teaching to see its effect on student participation and conceptual understanding. She explains the different spaces of the classroom, including the authoritative space, interactional space, and personal space. The analysis displays how an instructors multimodal choices involve student participation and understanding. On average the instructor used three to four modes, most often being some kind of gaze, gesture, and speech. He got students to participate by formulating a group definition of cultural stereotypes. It was found that those who are learning a second language depend on more than just spoken and written word for conceptual learning, meaning multimodal education has benefits.[35][33]

Multimodal assignments involve many aspects other than written words, which may be beyond an instructors education. Educators have been taught how to grade traditional assignments, but not those that utilize links, photos, videos or other modes. Dawn Lombardi is a college professor who admitted to her students that she was a bit "technologically challenged," when assigning a multimodal essay using graphics. The most difficult part regarding these assignments is the assessment. Educators struggle to grade these assignments because the meaning conveyed may not be what the student intended. They must return to the basics of teaching to configure what they want their students to learn, achieve, and demonstrate in order to create criteria for multimodal tasks. Lombardi made grading criteria based on creativity, context, substance, process, and collaboration which was presented to the students prior to beginning the essay.[33]

Another type of visuals-related writing task is visual analysis, especially advertising analysis, which has begun in the 1940s and has been prevalent in postsecondary writing instruction for at least 50 years.[34] This pedagogical practice of visual analysis did not focus on how visuals including images, layout, or graphics are combined or organized to make meanings.[34]

Then, through the following years, the application of visuals in composition classroom has been continually explored and the emphasis has been shifted to the visual features—margins, page layout, font, and size—of composition and its relationship to graphic design, web pages, and digital texts which involve images, layout, color, font, and arrangements of hyperlinks. In line with the New London Group, George (2002) argues that both visual and verbal elements are crucial in multimodal designs.[34]

Acknowledging the importance of both language and visuals in communication and meaning making, Shipka (2005) further advocates for a multimodal, task-based framework in which students are encouraged to use diverse modes and materials—print texts, digital media, videotaped performances, old photographs—and any combinations of them in composing their digital/multimodal texts. Meanwhile, students are provided with opportunities to deliver, receive, and circulate their digital products. In so doing, students can understand how systems of delivery, reception, and circulation interrelate with the production of their work.[36]

Multimodal communities

[edit]

Multimodality has significance within varying communities, such as the private, public, educational, and social communities. Because of multimodality, the private domain is evolving into a public domain in which certain communities function. Because social environments and multimodality mutually influence each other, each community is evolving in its own way. This evolution is evident in the language, as discussed by Grifoni, D'Ulizia, and Ferri in their work.[37]

Cultural multimodality

[edit]

Based on these representations, communities decide through social interaction how modes are commonly understood. In the same way, these assumptions and determinations of the way multimodality functions can actually create new cultural and social identities. For example, Bezemer and Kress define modes as "socially and culturally shaped resource[s] for making meaning." According to Bezemer, "In order for something to 'be a mode,' there needs to be a shared cultural sense within a community of a set of resources and how these can be organized to realize meaning."[[38]] Cultures that pull from different or similar resources of knowledge, understanding, and representations will communicate through different or similar modes.[20] Signs, for instance, are visual modes of communication determined by our daily necessities.

In her dissertation, Elizabeth J. Fleitz,a PhD in English with Concentration in Rhetoric and Writing from Bowling Green State University, argues that the cookbook, which she describes as inherently multimodal, is an important feminist rhetorical text.[39] According to Fleitz, women were able to form relationships with other women through communicating in socially acceptable literature like cook books; "As long as the woman fulfills her gender role, little attention is paid to the increasing amount of power she gains in both the private and public spheres." Women who would have been committed to staying at home could become published authors, gaining a voice in a phallogocentric society without being viewed as threats. Women revised and adapted different modes of writing to fit their own needs. According to Cinthia Gannett, author of "Gender and the Journal," diary writing, which evolved from men's journal writing, has "integrate[d] and confirm[ed] women's perceptions of domestic, social, and spiritual life, and invoke a sense of self."[40] It is these methods of remediation that characterize women's literature as multimodal. The recipes inside of the cookbooks also qualify as multimodal. Recipes delivered through any medium, whether that be a cookbook or a blog, can be considered multimodal because of the "interaction between body, experience, knowledge, and memory, multimodal literacies" that all relate to one another to create our understanding of the recipe. Recipe exchanging is an opportunity for networking and social interaction. According to Fleitz, "This interaction is undeniably multimodal, as this network "makes do" with alternative forms of communication outside dominant discursive methods, in order to further and promote women's social and political goals." Cookbooks are only a singular example of the capacity of multimodality to build community identities, but they aptly demonstrate the nuanced aspects of multimodality. Multimodality does not just encompasses tangible components, such as text, images, sound etc., but it also draws from experiences, prior knowledge, and cultural understanding.

Another change that has occurred due to the shift from the private environment to the public is audience construction.[41] In the privacy of the home, the family generally targets a specific audience: family members or friends. Once the photographs become public, an entirely new audience is addressed. As Pauwels notes, "the audience may be ignored, warned and offered apologies for the trivial content, directly addressed relating to personal stories, or greeted as highly appreciated publics that need to be entertained and invited to provide feedback."[41]

Multimodal academic writing practises

[edit]

In everyday life, multimodal construction and communication of meaning is ubiquitous. However, academic writing has maintained an overwhelming dominance of the linguistic resource up to the present (Blanca, 2015). The need to open the game to other possible forms of writing in the academy lies in the conviction that the semiotic resources used in the processes of academic inquiry and communication have an impact on the findings (Sousanis, 2015), since both processes are linked in the epistemic potential of writing, understood here in multimodal terms. Therefore, the idea is not about "embellishing" academic discourse with illustrative visual resources, but rather about enabling other ways of thinking, new associations; ultimately, new knowledge, arising from the interweaving of various verbal and nonverbal modes. The strategic use of page design, the juxtaposition of text in columns or of text and image, and the use of typography (in type, size, color, etc.) are just a few examples of how the semiotic potential of the genres of academic circulation can be exploited. This is linked to the possibilities of enriching the forms of academic writing by appealing to non-linear textual development in addition to linear, and by tensioning image and text in their infinite possibilities of creating meaning (Mussetta, Siragusa & Vottero, 2020;[42] Lamela Adó & Mussetta, 2020;[43] Mussetta, Lamela Adó & Peixoto, 2021[44])

Multimodal fiction

[edit]

There is now an increasing number of fictional narratives that explore and graphically exploit the text and the materiality of the book in its traditional format for the construction of meaning: these are what some critics call multimodal novels (Hallet 2009, p. 129; Gibbons 2012b, p. 421, among others), but which also receive the name of visual or hybrid (Luke 2013, p. 21; Reynolds 1998, p. 169; Sadokierski 2010, p. 7). These narratives include a variety of semiotic resources and modes ranging from the strategic use of different typographies and blank spaces, to the inclusion of drawings, photos, maps and diagrams that do not correspond to the usual notion of illustration, but are an indissoluble part of the plot, with specific functions in their contribution of meaning to the work in its multiple combinations (Mussetta 2014;[45] Mussetta, 2017a;[46] Mussetta, 2017b;[47] Mussetta 2017c;[48] Mussetta, 2020[49]).

Communication in business

[edit]

In the business sector, multimodality creates opportunities for both internal and external improvements in efficiency. Similar to shifts in education to utilize both textual and visual learning elements, multimodality allows businesses to have better communication. According to Vala Afshar, this transition first started to occur in the 1980s as "technology had become an essential part of business." This level of communication has amplified with the integration of digital media and tools during the 21st century.[50]

Internally, businesses use multimodal platforms for analytical and systemic purposes, among others. Through multimodality, a company enhances its productivity as well as creating transparency for management. Improved employee performance from these practices can correlate with ongoing interactive training and intuitive digital tools.[51]

Multimodality is used externally to increase customer satisfaction by providing multiple platforms during one interaction. With the popularity of with text, chat and social media during the 21st century, most businesses attempt to promote cross-channel engagement. Businesses aim to increase customer experience and solve any potential issue or inquiry quickly. A company's goal with external multimodality centers around better communication in real-time to make customer service more efficient.[52]

Social multimodality

[edit]

One shift caused by multi-literate environments is that private-sphere texts are being made more public. The private sphere is described as an environment in which people have a sense of personal authority and are distanced from institutions, such as the government. The family and home are considered to be a part of the private sphere. Family photographs are an example of multimodality in this sphere. Families take pictures (sometimes captioning them) and compile them in albums that are generally meant to be displayed to other family members or audiences that the family allows. These once private albums are entering the public environment of the Internet more often due to the rapid development and adoption of technology.[41]

According to Luc Pauwels, a professor of communication studies at the University of Antwerp, Belgium, "the multimedia context of the Web provides private image makers and storytellers with an increasingly flexible medium for the construction and dissemination of fact and fiction about their lives."[41] These relatively new website platforms allow families to manipulate photographs and add text, sound, and other design elements.[41] By using these various modes, families can construct a story of their lives that is presented to a potentially universal audience. Pauwels states that "digitized (and possibly digitally 'adjusted') family snapshots...may reveal more about the immaterial side of family culture: the values, beliefs, and aspirations of a group of people."[41] This immaterial side of the family is better demonstrated through the use of multimodality on the Web because certain events and photographs can take precedence over others based on how they are organized on the site,[41] and other visual or audio components can aid in evoking a message.

Similar to the evolution of family photography into the digital family album is the evolution of the diary into the personal weblog. As North Carolina State University professors, Carolyn Miller and Dawn Shepherd state, "the weblog phenomenon raises a number of rhetorical issues,… [such as] the peculiar intersection of the public and private that weblogs seem to invite."[22] Bloggers have the opportunity to communicate personal material in a public space, using words, images, sounds, etc. As described in the example above, people can create narratives of their lives in this expanding public community. Miller and Shepherd say that "validation increasingly comes through mediation, that is, from the access and attention and intensification that media provide."[22] Bloggers can create a "real" experience for their audience(s) because of the immediacy of the Internet. A "real" experience refers to "perspectival reality, anchored in the personality of the blogger."[22]

Digital applications

[edit]

Information is presented through the design of digital media, engaging with multimedia to offer a multimodal principle of composition. Standard words and pictures can be presented as moving images and speech in order to enhance the meaning of words. Joddy Murray wrote in "Composing Multimodality" that both discursive rhetoric and non-discursive rhetoric should be examined in order to see the modes and media used to create such composition. Murray also includes the benefits of multimodality, which lends itself to "acknowledge and build into our writing processes the importance of emotions in textual production, consumption, and distribution; encourage digital literacy as well as nondigital literacy in textual practice.[2] Murray shows a new way of thinking about composition, allowing images to be "sensuous and emotional" symbols of what they do represent, not focusing so much on the "conceptual and abstract."

Murray writes in his article, through the use of Richard Lanham's The Electronic World: Democracy, Technology, and the Arts, is an example of multimodality how "discursive text is in the center of everything we do," going on to say how students coexist in a world that "includes blogs, podcasts, modular community web spaces, cell phone messaging…", urging for students to be taught how to compose through rhetorical minds in these new, and not-so-new texts. "Cultural changes, and Lanham suggests, refocuses writing theory towards the image", demonstrating how there is a change in alphabet-to-icon ratios in electronic writing. One of these prime examples can see through the Apple product, the iPhone, in which "emojis" are seen as icons in a separate keyboard to convey what words would have once delivered.[53] Another example is Prezi. Often likened to Microsoft PowerPoint, Prezi is a cloud-based presentation application that allows users to create text, embed video, and make visually aesthetic projects. Prezi's presentations zoom the eye in, out, up and down to create a multi-dimensional appeal. Users also utilize different media within this medium that is itself unique.

Introduction of the Internet

[edit]

In the 1990s, multimodality grew in scope with the release of the Internet, personal computers, and other digital technologies. The literacy of the emerging generation changed, becoming accustomed to text circulated in pieces, informally, and across multiple mediums of image, color, and sound. The change represented a fundamental shift in how writing was presented: from print-based to screen-based.[54] Literacy evolved so that students arrived in classrooms being knowledgeable in video, graphics, and computer skills, but not alphabetic writing. Educators had to change their teaching practices to include multimodal lessons in order to help students achieve success in writing for the new millennium.

[55]

Accessing the audience

[edit]

In the public sphere, multimedia popularly refers to implementations of graphics in ads, animations and sounds in commercials, and also areas of overlap. One thought process behind this use of multimedia is that, through technology, a larger audience can be reached through the consumption of different technological mediums, or in some cases, as reported in 2010 through the Kaiser Family Foundation, can "help drive increased consumption".[citation needed] This is a drastic change from five years ago: "8–18 year olds devote an average of 7 hours and 38 minutes to using media across a typical day (more than 53 hours a week)."[citation needed] With the possibility of attaining multi-platform social media and digital advertising campaigns, also comes new regulations from the Federal Trade Commission (FTC) on how advertisers can communicate with their consumers via social networks.[56] Because multimodal tools are often tied to social networks, it is important to gauge the consumer in these fair practices. Companies like Burberry Group PLC and Lacoste S.A. (fashion houses for Burberry and Lacoste respectively) engage their consumers via the popular blogging site Tumblr; Publix Supermarkets, Inc. and Jeep engage their consumers via Twitter; celebrities and athletic teams/athletes such as Selena Gomez and The Miami Heat also engage their audience via Facebook through means of fan pages. These examples do not limit the presence of these specific entities to a single medium, but offer a wide variety of what is found for each respective source.

Advertising

[edit]

Multimedia advertising is the result of animation and graphic designs used to sell products or services. There are various forms of multimedia advertising through videos, online advertising and DVDs, CDs etc. These outlets afford companies the ability to increase their customer base through multimedia advertising. This is a necessary contribution to the marketing of the products and services. For instance, online advertising is a new wave example towards the use of multimedia in advertising that provides many benefits to the online companies and traditional corporations. New technologies today have brought on an evolution of multimedia in advertising and a shift from traditional techniques. The importance of multimedia advertising is significantly increased for companies in their effectiveness to market or sell products and services. Corporate advertising concerns itself with the idea that "Companies are likely to appeal to a broader audience and increase sales through search engine optimization, extensive keyword research, and strategic linking."[57] The concept behind the advertising platform can span across multiple mediums, yet, at its core, be centered around the same scheme.

Coca-Cola's advertising logo for their 2009 Open Happiness campaign

Coca-Cola ran an overarching "Open Happiness" campaign across multiple media platforms including print ads,[58] web ads, and television commercials.[59] The purpose of this central function was to communicate a common message over multiple platforms to further encourage an audience to buy into a reiterated message. The strength of such multimedia campaigns with multimedia is that it implements all available mediums - any of which could prove successful with a different audience member.[59]

Social media

[edit]

Social media and digital platforms are ubiquitous in today's everyday life.[60] These platforms do not operate solely based on their original makeup; they utilize media from other technologies and tools to add multidimensionality to what will be created on their own platform. These added modal features create a more interactive experience for the user.

Prior to Web 2.0's emergence, most websites listed information with little to no communication with the reader.[61] Within Web 2.0, social media and digital platforms are utilized towards everyday living for businesses, law offices in advertising, etc. Digital platforms begin with the use of mediums along with other technologies and tools to further enhance and improve what will be created on its own platform.[62]

Hashtags (#topic) and user tags (@username) make use of metadata in order to track "trending" topics and to alert users of their name's use within a post on a social media site. Used by various social media websites (most notably Twitter and Facebook), these features add internal linkage between users and themes.[63][64][65] Characteristics of a multimodal feature can be seen through the status update option on Facebook. Status updates combine the affordances of personal blogs, Twitter, instant messaging, and texting in a single feature. The 2013 status update button currently prompts a user, "What's on your mind?" a change from the 2007, "What are you doing right now?" This change was added by Facebook to promote greater flexibility for the user.[66] This multimodal feature allows a user to add text, video, image, links, and tag other users. Twitter's 140 character in a single message microblogging platform allows users the ability to link to other users, websites, and attach pictures. This new media is a platform that is affecting the literacy practice of the current generation by condensing the conversational context of the internet into fewer characters but encapsulating several media.

Other examples include the 'blog,' a term coined in 1999 as a contraction of "web log," the foundation of blogging is often attributed to various people in the mid-to-late '90s. Within the realm of blogging, videos, images, and other media are often added to otherwise text-only entries in order to generate a more multifaceted read.[67]

Gaming

[edit]

One of the current digital application of multimodality in the field of education has been developed by James Gee through his approach of effective learning through video games. Gee contends that there is a lot of knowledge about learning that schools, workplaces, families, and academics researchers should get from good computer and video games, such as a 'whole set of fundamentally sound learning principles' that can be used in many other domains, for instance when it comes to teaching science in schools.[68]

Storytelling

[edit]

Another application of multimodality is digital film-making sometimes referred to as 'digital storytelling'. A digital story is defined as a short film that incorporated digital images, video and audio in order to create a personally meaningful narrative. Through this practice, people act as film-makers, using multimodal forms of representation to design, create, and share their life stories or learning stories with specific audience commonly through online platforms. Digital storytelling, as a digital literacy practice, is commonly used in educational settings. It is also used in the media mainstream, considering the increasing number of projects that motivate members of the online community to create and share their digital stories.[69]   

Multimodal methods in social science research

[edit]

Multimodality is also a growing methodology being used in the social sciences. Not only do we see the area of multimodal anthropology, but there is also growing interest in this as a methodology in sociology and management.

For example, management researchers have highlighted the "material and visual turn" in organization research.[70] Going above and beyond the multimodal character of ethnographic research,[71] this growing area of research is interested in going beyond simply textual data as a single mode, for example, going beyond text to understand visual communication modes and issues such as the legitimacy of new ventures.[72] Multimodality might involve spatial, aural, visual, sensual and other data, perhaps with multiple modes embedded in a material object.[73]

Multimodality can be used particularly for meaning construction, for example in institutional theory, multimodal compositions can enhance the perceived validity of particular narratives.[74] Multimodal methods may also be used to deinstitutionalize unsustainable parts of an institution in order to sustain the institution.[75] Beyond institutional theory, we may find "multimodal historical cues" embedded in particular historical practices, highlighting the way organizations may use particular relationships to the past,[76] and multimodal discourses that allow organizations to claim legitimate yet distinctive identities, at least with visual and verbal discourses.[77] Sometimes work being done under the banner of multimodality spans into experimental research like that finding that the judgment of investors can be highly influenced by visual information, despite those individuals being relatively unaware of how much visual factors are influencing their decisions,[78] an area that suggests more research needs to be done on the power of memes and disinformation in visual modes driving social movements in social media.

One interesting point seen in this growing research area is that some researchers take the stance that multimodal research is not just going beyond a focus on text as data, but argues that to truly be multimodal, the research requires more than one modality. That is, engaging "with several modes of communication (e.g. visual and verbal, or visual and material)".[79] This seems to be a further development from researchers who align themselves with the multimodal label but then focus on a single modality such as images, for example, showing the interest in modalities beyond just textual data. Another interesting point for future research can be seen in contrasts, for example between multimodal and specifically "cross-modal" patterns.[80]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Multimodality in machine learning constitutes the development of computational models capable of processing, fusing, and reasoning across diverse data types or modalities, including text, images, audio, video, and sensory signals, thereby emulating aspects of human multisensory perception. This approach addresses limitations of unimodal systems by leveraging complementary information from multiple sources, enhancing tasks such as representation learning, cross-modal retrieval, and joint prediction. Early foundations emphasized modality alignment and fusion techniques, evolving into transformer-based architectures that enable scalable pretraining on vast datasets. Notable advancements include vision-language models like CLIP for zero-shot image classification and generative systems such as DALL-E for text-to-image synthesis, which have demonstrated superior performance in benchmarks for visual question answering and multimodal reasoning. Recent large multimodal models, including GPT-4o and Gemini, integrate real-time processing of text, vision, and audio, achieving state-of-the-art results in diverse applications from medical diagnostics to autonomous systems, though challenges persist in handling modality imbalances, data scarcity, and computational demands. These developments underscore multimodality's role in advancing toward generalist AI agents, with ongoing research focusing on robust fusion mechanisms and ethical alignment to mitigate amplified biases across modalities.

Core Concepts

Definition and Modes

Multimodality refers to the integration of multiple semiotic modes—linguistic, visual, aural, gestural, and spatial—in the process of meaning construction and communication, where each mode contributes distinct representational potentials rather than interchangeable functions. This approach draws from semiotic principles recognizing that communication exceeds single-channel transmission, instead leveraging the inherent affordances of diverse modes to encode and decode . Affordances denote the specific possibilities and constraints each mode offers for expression, such as sequencing versus simultaneity, with modes interacting causally but retaining non-equivalent roles in overall . The linguistic mode encompasses written and spoken words, providing precision through sequential syntax, explicit propositions, and deictic references that facilitate abstract reasoning and logical argumentation. It dominates in conveying denotative content and complex causal relations due to its capacity for disambiguation and universality in cognitive processing of propositional thought. The visual mode involves static or dynamic images, affording relational meanings through composition, color, and perspective that represent simultaneity and metaphorical associations more efficiently than linear description. The aural mode utilizes sound, music, and intonation to convey temporal flow, rhythm, and affective tone, enhancing emotional layering without visual or textual specificity. The gestural mode employs bodily movement, facial expressions, and posture to signal interpersonal dynamics and emphasis, often amplifying immediacy in proximal interactions. Finally, the spatial mode organizes elements via layout, proximity, and alignment to imply hierarchy and navigation, influencing perceptual salience independent of content. In multimodal ensembles, these modes do not merge into equivalence but interact through , where empirical reveals linguistic structures frequently anchoring interpretive stability for abstract domains, as non-linguistic modes excel in contextual or experiential cues but lack inherent tools for universal propositional encoding. This distinction underscores causal realism in : while synergies amplify efficacy, substituting modes alters fidelity, with linguistic primacy evident in tasks requiring deductive precision across cultures.

Theoretical Principles

Multimodal theory examines the causal mechanisms through which distinct semiotic modes—such as text, , , and —interact to produce integrated meanings, rather than merely cataloging their multiplicity. Central to this is the principle of , whereby modes are coordinated in specific ensembles to fulfill communicative designs, leveraging their complementary potentials for efficient meaning transfer. For instance, empirical analyses of situated practices demonstrate that orchestration enhances interpretive coherence by aligning modal contributions to task demands, as seen in micro-sociolinguistic studies of English-medium interactions where multimodal coordination outperforms isolated modes in conveying nuanced intent. Similarly, transduction describes the transformation of meaning across modes, such as converting textual propositions into visual depictions, which preserves core semantics while exploiting modal-specific capacities; this process is empirically grounded in semiotic redesign experiments showing measurable retention of informational fidelity post-transformation. A key causal principle is that of affordances, referring to the inherent potentials and constraints of each mode arising from material and perceptual properties, independent of purely social conventions. Visual modes, for example, afford rapid pattern recognition and spatial mapping due to parallel processing in the human visual system, enabling quick detection of relational structures that text handles less efficiently; cognitive psychology data indicate visual stimuli are processed up to 60,000 times faster than text for basic perceptual tasks. Conversely, textual modes excel in sequential logical deduction and abstract precision, as their linear structure aligns with deliberate reasoning pathways, with studies showing text-based arguments yielding higher accuracy in deductive tasks than equivalent visual representations. These affordances are not arbitrary but causally rooted in neurocognitive mechanisms, as evidenced by neuroimaging revealing distinct brain regions activated by modal types—e.g., ventral streams for visual object recognition versus left-hemisphere networks for linguistic syntax—underscoring biologically constrained integration limits. Rejecting overly constructionist interpretations that attribute modal efficacy solely to cultural , multimodal principles emphasize verifiable causal interactions testable through controlled experiments on comprehension outcomes. Meta-analyses of affect detection across 30 studies reveal multimodal integration improves accuracy by an average 8.12% over unimodal approaches, attributable to synergistic rather than interpretive variability. In complex learning contexts, multimodal instruction yields superior performance metrics—e.g., 15-20% gains in retention—due to reduced from distributed modal encoding, as per dual- models, rather than subjective social framing. This empirical realism prioritizes causal efficacy over descriptive multiplicity, highlighting how mode orchestration exploits affordances to achieve outcomes unfeasible unimodally, while critiquing constructivist overreach that downplays perceptual universals in favor of unverified .

Historical Development

Pre-Digital Foundations

Early explorations of multimodality emerged in film theory during the 1920s, where Soviet director Sergei Eisenstein developed montage techniques to integrate visual and auditory elements for constructing ideological narratives. In films like Strike (1925), Eisenstein juxtaposed images of animal slaughter with scenes of worker massacres to evoke emotional and political responses, demonstrating how editing could generate meaning beyond individual shots. This approach, part of Soviet montage theory, emphasized collision of disparate elements to produce dialectical effects, though it was later critiqued for its potential to manipulate audiences through constructed associations rather than objective representation. In the 1960s, semiotician Roland Barthes advanced analysis of image-text relations in his essay "Rhetoric of the Image" (1964), identifying three messages in visual artifacts: a linguistic message from accompanying text, a coded iconic message reliant on cultural conventions, and a non-coded iconic message based on direct resemblance. Barthes argued that images possess rhetorical structures akin to language, where text anchors ambiguous visual connotations to guide interpretation, as seen in advertising where verbal labels denote specific meanings to avert polysemy. This framework highlighted multimodal synergy—visuals enhancing textual persuasion—but also underscored risks of interpretive drift without linguistic stabilization, as unanchored images yield viewer-dependent readings. Building on such insights, linguist M.A.K. Halliday's , outlined in Language as Social Semiotic (1978), provided a foundational model for dissecting communication modes by viewing as a multifunctional resource shaped by social contexts. Halliday posited three metafunctions—ideational (representing experience), interpersonal (enacting relations), and textual (organizing information)—which extend to non-linguistic modes, enabling analysis of how visuals, gestures, or sounds realize meanings interdependently with verbal elements. Pre-digital rhetorical studies, drawing from these principles, evidenced that multimodal texts amplified persuasive impact in contexts like political posters or theater, yet empirical observations noted heightened when modes conflicted, as verbal clarity often mitigated visual in audience comprehension tests.

Key Theorists and Milestones

Gunther Kress and Theo van Leeuwen's Reading Images: The Grammar of Visual Design (1996) established a foundational framework for multimodality by adapting to , positing that images convey meaning through representational (depicting events and states), interactive (viewer-image relations), and compositional (layout and salience) metafunctions. This approach treats visual elements as a structured "grammar" equivalent to linguistic systems, enabling causal analysis of how design choices encode and social relations in advertisements, news images, and artworks. Empirical applications in studies have validated its utility for dissecting power dynamics in visual texts, such as viewer positioning via gaze vectors and modality markers like color saturation. However, the model's reliance on Western conventions—such as left-to-right reading directions and ideal-real information structures—reveals causal limitations in non-Western contexts, where bidirectional scripts or holistic compositions disrupt predicted salience hierarchies. Michael O'Toole's The Language of Displayed Art (1994) pioneered structural analyses of visual multimodality by applying systemic functional strata to artworks, dissecting ideational content ( actions and attributes), interpersonal (viewer distance via scale), and textual cohesion (rhythmic patterns across elements). O'Toole's method causally links artistic strata to interpretive effects, arguing that disruptions in one layer (e.g., ambiguous figures) propagate meaning across others, as seen in analyses of paintings like Picasso's . This work extended Hallidayan to static visuals, providing tools for empirical breakdown of how formal choices realize experiential realities over subjective interpretations. Jay Lemke advanced multimodality in the through extensions to hypertext, conceptualizing meaning as emergent from "hypermodality"—the non-linear orchestration of verbal, visual, and gestural modes in digital environments. In works like Multiplying Meaning (1998), Lemke demonstrated how scientific texts integrate diagrams and prose to multiply interpretive pathways, critiquing monomodal for ignoring causal interdependencies where visual vectors amplify verbal claims. His framework emphasized traversals across modes, validated in analyses of web interfaces where structures enforce semantic hierarchies beyond sequential reading. The New London Group's A of Multiliteracies (1996) marked a by formalizing multimodality within theory, urging to address diverse modes (visual, audio, spatial) amid and shifts. The prioritized "social "—meaning as culturally negotiated through multimodal ensembles—for designing equitable futures, influencing curricula to integrate design over rote decoding. Yet, its causal emphasis on constructed, context-bound literacies underplays evidence from on universal perceptual priors, such as infants' innate preference for structured patterns, potentially overattributing modal efficacy to social factors alone.

Shift to Digital Era

The proliferation of internet technologies after 2000 facilitated the integration of multiple semiotic modes in digital communication, with advancements in and CSS enabling precise spatial layouts that combined text, images, and hyperlinks for enhanced visual and navigational affordances. This shift allowed web content to transcend static textual forms, incorporating dynamic visual elements that supported richer meaning-making processes. In the mid-2000s, platforms, characterized by and interactive features, further expanded multimodality by incorporating aural and gestural elements through embedded videos and uploads. Sites like , launched in 2005, enabled widespread sharing of audiovisual material, blending , sound, and visual imagery in participatory . These developments democratized multimodal production, shifting from producer-dominated to user-driven content ecosystems. The 2007 introduction of the marked a pivotal advancement in gestural multimodality, with its multi-touch capacitive screen supporting intuitive finger-based interactions such as pinching, swiping, and tapping to manipulate digital interfaces. This innovation, popularized through smartphones, integrated bodily gestures into mobile communication, amplifying multimodal engagement on social platforms like and , where users combined text, images, videos, and touch-based navigation. While these digital affordances increased the density of communicative modes, using eye-tracking in the 2010s has demonstrated risks of cognitive overload, with users exhibiting fragmented and prolonged fixation durations in high-multimode environments lacking linguistic . Studies indicate that without structured textual framing to guide interpretation, the simultaneous processing of visual, auditory, and interactive elements strains , potentially reducing comprehension efficacy. This causal dynamic underscores the need for design principles that balance mode to mitigate overload in digital multimodal texts.

Applications in Communication

Media and Advertising


In traditional media and , multimodality employs integrated textual, visual, and auditory modes to heighten persuasive impact through synergistic processing, where combined elements reinforce message retention and emotional resonance more effectively than isolated modes. commercials exemplify this by synchronizing dynamic visuals with voiceovers, , and superimposed text, fostering deeper cognitive encoding via dual-channel stimulation of sight and hearing. Empirical investigations confirm that such multimodal configurations in TV ads enhance and compared to unimodal presentations, as the interplay amplifies neural engagement and associative learning.
Print advertisements similarly leverage visual alongside textual slogans to boost , with congruent mode pairings yielding superior and attitude formation by exploiting perceptual primacy of images over words. Marketing analyses attribute commercial successes, such as increased sales in campaigns like Coca-Cola's "" initiative—which blended vibrant visuals, uplifting , and aspirational text—to this mode , enabling broader audience immersion and behavioral nudges toward purchase. However, achievements in engagement must be weighed against drawbacks; while multimodality drives efficacy in mass , it risks amplifying manipulative potentials when visuals evoke unchecked emotional appeals. Historical tobacco campaigns illustrate these cons, where alluring visuals overrode factual textual constraints on health risks, prioritizing sensory allure to shape perceptions. The Joe Camel series (1988–1997), featuring a stylized in adventurous scenarios, propelled Camel's youth from 0.5% to 32.8% by 1991, correlating with a 73% uptick in daily youth smoking rates amid the campaign's run. This visual dominance fostered brand affinity in impressionable demographics, bypassing rational evaluation of hazards via emotive heuristics. Causally, over-dependence on visuals correlates with elevated vulnerability, as rapid image processing (occurring in milliseconds) primes intuitive judgments that textual qualifiers struggle to temper, potentially eroding critical consumer discernment in favor of heuristic-driven behaviors.

Social Media and Gaming

In social media platforms that emerged or evolved in the , such as (launched October 6, 2010) and (global rollout from 2017), multimodality emphasizes visual imagery, short-form videos, and overlaid audio over predominantly textual content, fostering higher user interaction and content dissemination. Video posts on generate 49% more than static photo posts, driven by algorithmic prioritization of dynamic formats that combine motion, sound, and captions. On , accounts with over 10 million followers achieve average rates of 10.5% for video-centric content, where 72% of views occur within the first day, underscoring the rapid virality of multimodal elements like music-synced visuals and effects. These platforms' designs exploit sensory integration to boost shares and retention, with interactions on rising 42% year-over-year as of 2025, compared to slower gains in text-heavy networks. Video games leverage multimodality through synchronized visual rendering, directional audio cues, haptic feedback, and gesture-based inputs via controllers or motion tracking, creating immersive environments that heighten player participation since the mainstream adoption of 3D graphics in the and VR hardware like the in 2012. Immersive VR applications post-2010 have demonstrably improved spatial reasoning, with studies showing enhanced comprehension of 3D transformations and problem-solving in graphics-related tasks among learners exposed to such systems. Engagement metrics indicate sustained motivation and performance gains across sessions, attributed to the technology's core traits of immersion, , and sensory imagination. However, empirical data links excessive multimodal gaming to addictive patterns that impair deeper cognitive processes, including reduced for reading and lower comprehension scores. Research on elementary and adolescent players finds correlating with deficits in , , and academic skills, potentially exacerbating declines in textual amid prolonged exposure to visually dominant interfaces. While action-oriented games may bolster certain perceptual speeds, the overall shift toward immersive, non-linear experiences has drawn criticism for diminishing sustained , with addicted users showing systematically poorer learning outcomes.

Storytelling and Fiction

Multimodality in and integrates linguistic text with visual, auditory, or gestural elements to construct , as exemplified in and graphic novels where sequential images convey action and emotion alongside . Scott McCloud's (1993) delineates how juxtapose icons and words to form a "" of visual , enabling abstraction levels from realistic depiction to symbolic representation that enhance thematic depth without relying solely on . This fusion allows creators to depict internal states or temporal shifts more efficiently than text alone, as visuals handle spatial and affective cues while text anchors causal exposition. Post-2000s digital developments, such as webtoons originating in around 2003, exemplify transmedia extensions of multimodal fiction through vertically scrolling formats optimized for mobile devices, combining static or animated panels with overlaid text and sound effects. These platforms enable serialized narratives that adapt across media, like webtoons spawning live-action adaptations, thereby expanding audience engagement via iterative mode layering—initially image-text hybrids evolving into interactive or cross-platform experiences. Such integration yields richer immersion, with multimodal features eliciting stronger reading-induced and emotional arousal compared to unimodal text, as foregrounded visuals and devices activate broader cognitive processing in empirical reader response studies. Creators leverage this for evocative world-building, where synchronized modes amplify sensory realism and mnemonic retention, fostering prolonged reader investment in fictional universes. Yet, empirical analyses reveal coherence challenges when modes imbalance, such as visuals overwhelming textual plot drivers, leading to fragmented comprehension of scene transitions and ; studies on multimodal cohesion demonstrate that weak inter-modal links correlate with higher error rates in interpreting sequential events. Over-reliance on non-linguistic modes can dilute focus on logical chains essential to plot progression, with viewer experiments showing increased segmentation and explanatory deficits in low-cohesion multimodal sequences, underscoring the primacy of linguistic precision for sustaining causal realism in . Balanced multimodality thus demands deliberate to avoid confounding , prioritizing textual for sequential fidelity over ornamental divergence.

Educational and Pedagogical Uses

Multiliteracies Framework

The Multiliteracies Framework emerged from the New London Group's 1996 manifesto, which argued for expanding to encompass diverse modes of meaning-making amid and technological shifts. The group, comprising educators including Courtney Cazden, Bill Cope, and , posited that traditional alphabetic insufficiently addressed the "multiplicity of communications channels and media" and varied cultural "lifeworlds," necessitating a that integrates linguistic, visual, audio, gestural, and spatial modes to equip learners for designing social futures. This causal claim rested on observations of economic restructuring and digital proliferation, though the framework's theoretical emphasis on adaptation lacked contemporaneous empirical validation of improved outcomes over text-centric methods. Central to the framework are four pedagogical components: situated practice for authentic immersion in multimodal contexts; overt instruction to develop metalanguages for analyzing mode-specific designs; critical framing for examining power dynamics in texts; and transformed practice for learners to and produce hybrid artifacts. These enable "designing" texts as active processes, where meaning arises from orchestrated modes rather than isolated , aiming to foster agency in diverse communicative ecologies. Proponents credit this with enhancing creative expression, as seen in applications promoting student-generated digital narratives that blend text and visuals to negotiate identities. However, empirical assessments reveal gaps in the framework's efficacy for core gains. A 2018 of multiliteracies studies found that while self-reported benefits in and multimodal production were common, rigorous quantitative evidence of sustained reading or writing proficiency was sparse, with many investigations hampered by small samples, lack of controls, and qualitative dominance over randomized designs. This aligns with broader reading research indicating that foundational alphabetic decoding—prioritized in text-based instruction—predicts comprehension more reliably than multimodal exposure alone, as non-linguistic modes often presuppose textual for precision and abstraction. The normalization of modal equivalence overlooks causal hierarchies where weak print skills undermine multimodal interpretation, per efficacy meta-analyses predating and postdating the framework.

Classroom Implementation

In K-12 classrooms, multimodal implementation involves embedding visual, auditory, and interactive elements into core subjects, such as using software like VoiceThread or Adobe Spark—tools popularized after 2010—to combine text, images, and narration for student projects on historical events or scientific processes. These methods align with curriculum reforms, including aspects of the U.S. State Standards introduced in 2010, which encourage producing presentations to demonstrate understanding, as seen in state-level adoptions requiring students to integrate in English language arts and . Randomized controlled trials indicate short-term boosts in engagement from multimodal approaches, with one experiment showing multiple content representations—such as text paired with visuals—increasing and immediate recall by up to 20% compared to text-only instruction. Similarly, interventions using multimodal narratives for development have demonstrated causal improvements in narrative coherence among elementary students, as measured pre- and post-intervention in controlled settings. However, these gains often demand substantial resources, including teacher training and technology access, which strain underfunded districts; a 2021 analysis of programs noted that without adequate , exacerbates inequities, with only 60% of U.S. K-12 schools reporting sufficient devices for multimodal tasks by 2020. Critiques highlight potential distractions from foundational skills, as multimodal emphasis can dilute focus on linguistic proficiency; controlled studies on reading interventions reveal that while visual aids enhance initial comprehension, long-term analytical writing suffers without explicit and instruction, with effect sizes dropping below 0.3 after six months in groups prioritizing non-text modes. National assessments like the NAEP, tracking K-12 reading proficiency, show stagnant scores since 2010 despite widespread multimodal adoption, suggesting causal links to reduced emphasis on deep textual analysis, as correlational data from districts heavy in digital tools correlate with 5-10% declines in advanced writing metrics. Empirical reviews underscore that short-term engagement from tools like does not consistently translate to sustained gains, particularly when causal pathways overlook verbal mode primacy for abstract reasoning.

Higher Education Practices

In higher education, multimodal assignments such as digital portfolios and video essays gained prominence in the , driven by the integration of digital tools into curricula. These practices involve students combining textual analysis with visual, auditory, and interactive elements to articulate complex ideas, often in composition, , and interdisciplinary courses. For example, institutions have adopted standardized formats like infographics, podcasts, and research posters to evaluate synthesis of information across modes. Video essays, produced using editing software, enable demonstration of depth through narrated , assessing not only content but also production skills. Such assignments aim to build interdisciplinary competencies, including and creative expression, preparing students for media-saturated professional environments. Empirical studies attribute achievements to these practices, such as improved soft skills like creativity and critical thinking in STEM fields; a 2022 analysis of video essays and podcasts in engineering education found enhanced higher-order skills via multimodal reflection. However, verifiable efficacy remains debated, with 2020s research highlighting mixed outcomes for core higher education demands like abstract reasoning. A 2023 comparative study of traditional monomodal writing and digital multimodal composition reported longer texts in multimodal tasks and gains in both formats, but no conclusive evidence of superior critical engagement in multimodal approaches, suggesting traditional essays may better enforce linear argumentation and depth. Comprehensive reviews of multimodal applications in education note persistent assessment challenges, including difficulty quantifying contributions from non-textual modes, which can lead to inconsistent evaluation of analytical rigor. From a causal perspective, higher education prioritizes textual precision for fostering independent abstract thought, where multimodality aids —particularly for visual or auditory learners—but functions as adjunctive rather than foundational. Critics observe that overemphasis on aesthetic elements risks superficiality, with repetitive mode reinforcement failing to advance beyond basic synthesis, potentially undermining the causal primacy of rigorous written in developing sustained . While no large-scale data directly ties multimodal grading to , broader concerns in alternative assessments warn of leniency in rubric application, where production values may inflate scores absent strict content benchmarks. Empirical gaps persist, as academic sources—often from fields—predominantly advocate multimodal integration, warranting skepticism toward unsubstantiated claims of transformative impact over traditional methods.

Social and Cultural Dimensions

Multimodal Communities

Multimodal communities consist of groups, both , where participants engage through integrated modes of communication such as text, images, videos, and memes, shaping distinct norms for interaction and . In online fandoms, for instance, members of the Star Wars community utilize memes for redistribution, recontextualization, and remediation, fostering and shared understanding through these visual-text hybrids. Similarly, participatory meme in broader enables active remixing of content, driving cohesion by allowing fans to contribute to expansions via viral visuals and videos. Empirical variations in mode preferences emerge across demographics, with younger users aged 18-24 exhibiting a stronger inclination toward image-led platforms like over text-dominant ones, reflecting generational shifts in expressive norms. Racial and ethnic differences also influence preferences; for example, garners higher usage among Latino users compared to whites, who favor , indicating how multimodal elements like photos and short videos align with cultural interaction styles. These patterns underscore causal dynamics where shared multimodal artifacts—such as fandom videos—strengthen social bonds by signaling in-group affiliation, yet they can prioritize emotional resonance over analytical depth. While multimodal practices enhance community building by facilitating rapid idea spread and belonging, they also exacerbate s, particularly in visual-heavy environments where content reinforces preexisting views more potently than text-based . Studies of short video platforms like reveal pronounced effects through algorithmic amplification of similar visuals, reducing exposure to diverse perspectives compared to text forums. In high-visual groups, suffers, as evidenced by online yielding lower gains and tolerance than text-mediated exchanges, with participants showing diminished critical due to the emotive pull of images over argumentative parsing. This visual bias causally tilts toward rather than contestation, amplifying insularity in communities reliant on memes and videos.

Business and Professional Contexts

In corporate presentations and reports, multimodality integrates text, visuals, charts, and sometimes audio to distill complex data, a practice accelerated by the adoption of PowerPoint software following its 1987 debut and the proliferation of tools in the 1990s. These formats enable professionals to layer quantitative metrics with explanatory graphics, as seen in annual reports where financial tables are paired with trend visualizations to highlight performance drivers. Empirical evidence from Mayer's learning supports this approach, showing that combining verbal and visual elements yields superior comprehension over text-only formats, with experimental groups demonstrating measurably higher problem-solving accuracy in professional training scenarios. Benefits include enhanced persuasion and retention, particularly for data-heavy communications like sales pitches or strategy briefings. Mayer's principles, derived from controlled studies, indicate that multimodal designs reduce by aligning visuals with narration, leading to retention improvements in tasks relevant to executive . For instance, integrating infographics in reports clarifies causal relationships in market analyses, with business applications reporting faster audience buy-in during boardroom sessions compared to unimodal alternatives. However, adoption hinges on verifiable efficiency gains, such as shortened meeting times or elevated close rates in client interactions, rather than unproven mandates for broader "inclusivity" without corresponding return-on-investment data. Drawbacks arise from visual dominance, where overreliance on can obscure nuances or foster superficial judgments, as evidenced in critiques of cultures prioritizing over analytical depth. Misinterpretation risks are amplified in high-stakes contexts like financial disclosures, where poorly scaled charts have led to erroneous investor assumptions, underscoring the need for rigorous validation. Cognitive overload from excessive modes further hampers efficacy, with studies on virtual meetings revealing that unintegrated visuals disrupt verbal flow and reduce decision accuracy. Businesses mitigate these through principles like Mayer's coherence guideline, which advocates stripping extraneous elements to maintain focus on core metrics. Efficiency metrics underscore multimodal value when tied to outcomes: internal communications leveraging visuals correlate with higher employee , though quantification demands linking to business KPIs like impact or reduction rates. In strategy sessions, multimodal tools facilitate quicker consensus on causal factors, but only where empirical testing confirms net gains over simpler modes, reflecting a pragmatic orientation toward evidence-based .

Cultural Variations

High-context cultures, such as those in (e.g., and ), integrate gestural, visual, and nonverbal modes more prominently alongside implicit verbal elements to convey meaning through relational context, whereas low-context cultures, including Anglo-American societies, prioritize explicit linguistic content with secondary reliance on other modes for clarity. This distinction, rooted in T. Hall's framework, receives empirical support from comparative analyses of communication artifacts. A 2020 study of user instructions found that Chinese manuals emphasize visuals (e.g., diagrams and images) to a greater degree than Western manuals, which favor textual explanations, indicating a cultural preference for multimodal density in high-context settings to leverage shared contextual knowledge. Similarly, cross-national experiments reveal East Asians detect changes in peripheral visual contexts more readily than Americans, who focus on central objects, demonstrating how cultural norms shape attention to multimodal elements while interacting with universal cognitive mechanisms like Gestalt perception. These patterns align with Hofstede's cultural dimensions, where high collectivism and long-term orientation in Asian societies correlate with holistic processing that amplifies visual and relational modes over individualistic verbal linearity. Such variations contribute to global communication challenges, including misinterpretations in exchanges where low-context participants undervalue implicit gestural cues, as evidenced by higher error rates in decoding indirect messages among Westerners exposed to East Asian nonverbal styles. Successful adaptations include multinational campaigns that hybridize explicit text for Western markets with contextual visuals for Asian audiences, reducing comprehension gaps by 20-30% in empirical ad recall tests. However, dominant multimodal theories, originating from Western scholars like Kress and van Leeuwen, face criticism for ethnocentric bias—assuming linguistic modes as foundational while marginalizing non-Western visual-gestural systems prevalent in high-context traditions—potentially skewing analyses due to academia's systemic underrepresentation of Eastern empirical data.

Computational and AI Perspectives

Early Digital Implementations

The , proposed by in 1990 at , marked an early digital implementation of hypermedia by linking hypertext with elements such as images and later audio, enabling non-linear navigation across distributed content. In the mid-1990s, technology proliferated hypermedia applications, particularly in education, where software integrated text, static graphics, sound clips, and basic animations into interactive environments like digital encyclopedias and exploratory simulations. These systems represented a departure from purely textual interfaces, allowing users to engage multiple sensory modes simultaneously for enhanced and learning. Adobe Flash, launched as FutureSplash Animator in December 1996 and acquired by (later in 2005), facilitated dynamic multimodal content on the web through vector-based animations, scripting for , and support for video and audio streams. By the early , Flash powered tools, such as browser-based simulations and games that combined visual, auditory, and kinetic elements to simulate real-world scenarios, achieving widespread adoption for its compact file sizes relative to raster alternatives. Achievements included improved user engagement in pedagogical contexts, with empirical evaluations showing that well-structured multimodal interfaces could boost retention rates compared to text-only formats, provided synchronization between modes was maintained. Despite these advances, early implementations faced significant constraints from limited bandwidth, with dial-up connections averaging 28-56 kbps in the , resulting in times exceeding several minutes for even modest video or animated files. studies from the era revealed mode overload issues, where excessive simultaneous presentation of text, visuals, and audio led to diminished comprehension and increased , as users struggled to process competing informational channels without integrated design principles. This period's transition from static pages—prevalent in the early —to dynamic, script-driven content via tools like Flash and early server-side technologies in the late and established foundational patterns for multimodal integration, emphasizing the need for selective modality use to mitigate technical and perceptual limitations ahead of broader accessibility.

Modern Multimodal AI

Modern multimodal AI systems, emerging prominently after 2020, integrate processing of diverse data types such as text, images, audio, and video within unified frameworks, enabling more holistic understanding and generation compared to unimodal predecessors. OpenAI's , released in March 2023, introduced vision capabilities to the GPT-4 architecture, allowing analysis of images alongside text for tasks like visual question answering. xAI's Grok-1.5V, previewed in April 2024, extended the series with multimodal vision processing for documents, diagrams, and photos, achieving competitive performance in real-world spatial understanding benchmarks. These developments marked a shift toward models handling multiple inputs natively, with subsequent releases like OpenAI's in May 2024 incorporating real-time audio, vision, and text in a single architecture. By 2025, advances emphasized unified architectures capable of seamless cross-modal reasoning, such as Google's Gemini series and extensions in models like Qwen2.5-VL, which process text, images, and video through shared transformer-based encoders to reduce modality silos. These systems leverage techniques like cross-attention mechanisms to align representations across data types, facilitating applications in dynamic environments. Market growth reflected this momentum, with the global multimodal AI sector valued at approximately USD 1.0 billion in 2023 and projected to reach USD 4.5 billion by 2028 at a (CAGR) of 35%, driven by demand in sectors requiring integrated perception. In real-world applications, multimodal AI has demonstrated efficacy in by fusing imaging data with textual clinical records; for instance, models evaluating NEJM Image Challenges achieved accuracies surpassing individual modalities alone, aiding in distinguishing conditions like from chest X-rays and electronic health records. Empirical benchmarks from 2024-2025 highlight superior performance in image captioning, where models like GPT-4o and NVLM-D-72B outperform prior systems on datasets emphasizing detailed descriptions, with correlation to evaluations exceeding 90% in automated metrics. However, causal limitations persist, particularly in hallucinations—outputs inconsistent with input visuals or facts—arising from training data discrepancies and alignment failures, affecting up to 82% of responses in some evaluations and undermining reliability in high-stakes domains. Ongoing research focuses on mitigation through holistic aggregation and open-set detection protocols to enhance factual grounding across modalities.

Technical Challenges and Advances

One primary technical challenge in multimodal AI systems is the alignment of disparate modalities, such as vision and , where representations must be mapped into a shared to enable cross-modal understanding and transfer. This often involves addressing gaps in semantic correspondence, as visual features like spatial hierarchies differ fundamentally from textual sequential structures, leading to inefficiencies in joint reasoning tasks. Fusion techniques—categorizable as early (input-level ), late (decision-level aggregation), or intermediate (feature-level integration)—further complicate this, requiring mechanisms to weigh modality contributions dynamically without losing inter-modal correlations. Data scarcity exacerbates these issues, particularly for paired multimodal datasets that capture rare real-world combinations, resulting in models prone to or poor when modalities are missing or noisy. Real-world data often exhibits heterogeneity, with incomplete entries (e.g., text without images) demanding imputation or robust handling, which current parametric approaches struggle with in low-data regimes due to reliance on large-scale pretraining. Advances in contrastive learning, such as OpenAI's CLIP model released in January 2021, mitigate alignment challenges by pretraining on 400 million image-text pairs via zero-shot prediction, enabling scalable vision-language transfer without task-specific fine-tuning. Recent fusion innovations from 2024 onward emphasize transformer architectures for intermediate fusion, incorporating dynamic gating to adaptively prioritize modalities based on input context, as seen in approaches like Dynamic Multi-Modal Fusion for materials science tasks. These build on transformer scalability, leveraging self-attention for parallel processing of multimodal tokens, though quadratic complexity in sequence length imposes compute limits, capping practical dense models at around 1-10 trillion parameters without hardware breakthroughs. Benchmarks like MMMU, introduced in November 2023, evaluate these advances through 11,500 multi-discipline questions requiring college-level reasoning across six modalities, revealing persistent gaps where even leading models score below 60% accuracy compared to human experts at 72-88%. Despite progress, bias amplification remains a drawback, as multimodal fusion can exacerbate imbalances from individual modalities—e.g., visual stereotypes reinforcing textual prejudices—necessitating causal debiasing techniques beyond mere filtering. Ethical practices are constrained by sourcing limitations, with over-reliance on web-scraped corpora ignoring and , underscoring the need for verifiable, diverse to ensure causal fidelity over correlative hype in scaling narratives. Overall, while transformers facilitate modality-agnostic architectures, fundamental compute and bottlenecks highlight that unchecked optimism overlooks realities like power constraints projected to halt exponential scaling by 2030 without paradigm shifts.

Research Methodologies

Analytical Approaches

Systemic functional multimodal discourse analysis (SF-MDA) extends to examine how multiple semiotic modes—such as , visuals, and layout—construct meaning in artifacts like advertisements or websites, originating from the foundational work of Gunther Kress and Theo van Leeuwen in their 2001 book Multimodal Discourse. This approach treats modes as social semiotic resources with distinct grammars, avoiding assumptions of equipollence where modes contribute equally to overall meaning, and instead emphasizes their hierarchical or complementary roles in ideational, interpersonal, and textual metafunctions. SF-MDA enables by dissecting how modal interactions generate specific interpretive effects, testable through structured coding of representational structures, such as vectorial in images or processes in text-image hybrids. Analytical steps typically begin with mode identification, cataloging elements like linguistic syntax, color palettes, or gestural cues in a multimodal text, followed by interaction mapping to trace how these modes co-construct significance—for instance, how reinforces visual salience. Tools such as ELAN, developed by the Institute for since 2001, facilitate this by enabling time-aligned annotations of video or audio data across tiers for gestures, speech, and visuals, supporting precise temporal linkage of modal contributions. Empirical validity is assessed via metrics, like , applied in studies of multimodal annotations; for example, analyses of metaphorical mappings in visuals yield reliability scores above 0.70 when coders are trained on shared metafunctional criteria, confirming replicable hypothesis testing on meaning emergence. This rigor distinguishes SF-MDA from less formalized methods by grounding causal claims in observable semiotic patterns rather than subjective intuition.

Empirical Studies in Social Sciences

Empirical investigations in and have quantified the influence of multimodal communication—integrating text, images, and videos—on collective behaviors, such as mobilization, using datasets from platforms. A 2024 analysis of activity during various social movements revealed that posts incorporating images or videos achieved significantly higher levels of audience engagement, including likes, retweets, and replies, compared to purely textual equivalents, thereby accelerating the spread of activist frames across networks. This quantitative edge stems from visuals' capacity to evoke rapid emotional responses and simplify complex narratives, as evidenced by regression models controlling for post timing and user influence. Experimental designs offer causal insights into multimodality's mobilizing effects. In a study involving 143 German university students, exposure to emotional images embedded in news articles—tracked via eye-fixation duration—increased participants' self-reported willingness to engage in political action, with positive emotions like fascination showing stronger than negative ones; for high-interest individuals, each additional of image viewing boosted intent by 0.037 units on a standardized scale. Such pre-post manipulations isolate visual stimuli's direct impact, contrasting with correlational field data where multimodality correlates with turnout but confounds like network obscure causality. In information diffusion, multimodal formats amplify persistence alongside legitimate content. A mixed-methods review of 96 instances from early 2020 identified visuals as key amplifiers: 39% illustrated for heightened recall, 52% masqueraded as through mislabeling (35%) or manipulation (10%), and 9% impersonated sources to confer false , exploiting indexical trust in images to evade textual scrutiny. Complementing this, a 2023 analysis of COVID-related image tweets found -embedded visuals sustained longer diffusion timelines and burst durations than neutral counterparts, though interaction volumes remained comparable, attributing endurance to multimodal resonance with partisan audiences—e.g., visuals among pro-Republican users. These findings underscore multimodality's dual role in enhancing message efficacy while elevating mis/ risks, yet debates persist over methodological rigor. Observational studies dominate due to data availability, but they struggle with endogeneity and , favoring experimental or quasi-experimental approaches for robust causal claims; overreliance on qualitative multimodal discourse risks interpretive subjectivity absent quantitative benchmarks like metrics or randomized exposures. Peer-reviewed outlets prioritize such hybrid validations, though institutional biases in social sciences toward narrative-driven analyses may underemphasize null or adverse multimodal outcomes.

Criticisms and Debates

Theoretical Limitations

Multimodal theory posits that diverse semiotic modes—such as linguistic, visual, and gestural—contribute equivalently to meaning construction, yet this assumption overlooks the primacy of language in structuring complex communication. Formal linguistic perspectives emphasize language's unique capacity for recursion, generativity, and propositional precision, which non-linguistic modes cannot fully replicate without hierarchical subordination to verbal syntax. In multimodal interactions, linguistic elements often provide the causal framework that organizes and disambiguates other modes, rather than modes operating as interchangeable equals; treating them as such risks diluting analytical rigor by ignoring how thought and reference fundamentally rely on linguistic universality. A core theoretical limitation lies in the framework's descriptive rather than predictive nature, compounded by subjectivity in interpreting non-linguistic modes. Unlike linguistic analysis, which benefits from standardized grammars and cross-cultural universals, visual and gestural elements admit highly variable readings influenced by context, culture, and analyst bias, rendering multimodal claims difficult to falsify objectively. This lack of falsifiable hypotheses stems from the absence of universal metrics for mode integration, leading to post-hoc rationalizations rather than testable propositions about causal interactions between modes. Empirical validation is further hampered, as studies in psycholinguistics reveal that while multimodal cues can facilitate basic processing, they do not yield net cognitive advantages in abstract or complex tasks where unimodal linguistic input suffices for hierarchical reasoning. Critics argue that multimodal theory's normalization of mode interchangeability neglects causal realities, such as from non-hierarchical fusion, where non-linguistic elements may introduce noise without enhancing propositional depth. For instance, in , the equipotence assumption fails to account for scenarios where linguistic primacy determines interpretive outcomes, as non-verbal modes derive meaning primarily through verbal anchoring. Theoretical sparsity exacerbates this, with empirical multimodal research outpacing foundational models that rigorously delineate mode dependencies, often resulting in unfalsifiable generalizations about "integrated wholes" without specifying integration mechanisms. Such limitations underscore the need for first-principles reevaluation prioritizing empirical over expansive semiotic inclusivity.

Empirical Critiques

Empirical studies in have identified conditions under which multimodal processing yields inferior outcomes to unimodal approaches, particularly when multiple sensory inputs exceed available cognitive resources. According to multiple resource theory, combining modalities such as visual and auditory can overload parallel processing channels, resulting in degraded performance on complex tasks compared to single-modality presentations. For instance, experiments measuring task accuracy and response times have shown that bimodal stimuli under high load conditions amplify interference effects, leading to higher error rates than unimodal equivalents. In communication contexts, eye-tracking data reveal patterns of overload from multimodal inputs, with participants exhibiting increased fixation durations and saccade regressions indicative of processing strain. One study using mobile eye-tracking during stressful multimodal interactions found elevated metrics, including dilated pupils and fragmented gaze patterns, which correlated with reduced task efficiency and heightened . These findings underscore opportunity costs, as divided attention across modalities diverts resources from deep comprehension, favoring concise unimodal text for sustained retention in information-dense scenarios. Educational applications face similar scrutiny, with longitudinal trends questioning multimodal efficacy amid declines in core skills. The OECD's 2022 assessment reported historic drops in reading literacy—averaging 15 points across participating countries from pre-pandemic baselines—coinciding with expanded digital multimodal curricula since 2015, potentially at the expense of foundational decoding and drills. While randomized trials on specific interventions are limited, aggregated data from high-tech adoption periods highlight negligible gains in basic proficiency, attributing stagnation to extraneous load from unoptimized that fragments focus on essentials. This suggests that hype around multimodality overlooks trade-offs, where integration without rigorous principles incurs net losses in skill mastery over 2015–2025.

Controversies in Application

The proliferation of multimodal deepfakes, which integrate manipulated audio, video, and text since their emergence around , has fueled controversies over and public trust erosion. In the context of U.S. elections from to 2020, concerns mounted that such technologies could fabricate political scandals or endorsements, prompting Assembly Bill 730 in 2019 to prohibit deepfakes intended to influence campaigns, though the law lapsed in 2021. Empirical experiments demonstrate that exposure to deepfakes depicting public figures in fabricated compromising situations leads to measurable declines in trust toward government institutions and media credibility, with participants showing reduced confidence even when aware of potential fabrication. Political advertising during the and U.S. presidential cycles exemplified exploitation of multimodal visuals, where campaigns combined , , and symbolic elements to amplify persuasive impact beyond textual arguments. Analyses of election posters and videos reveal strategies emphasizing visual —such as emotive paired with selective text—to sway voter perceptions, often prioritizing affective appeal over factual substantiation. These tactics, while not always involving outright fabrication, contributed to polarized by leveraging the higher memorability and emotional potency of visual modes, as evidenced in rhetorical breakdowns of campaign materials. Applications of multiliteracies , which advocate multimodal to promote equity in outcomes, have drawn for overlooking merit-based skill hierarchies and failing to empirically close gaps across diverse socioeconomic groups. Despite theoretical claims of inclusivity through diverse modes like visuals and digital texts, studies indicate persistent disparities in proficiency and , with lower-income or minority students showing limited gains in standardized outcomes relative to peers, suggesting causal factors like foundational skill deficits remain unaddressed. On visual-dominant platforms such as and , data underscores , where a small fraction of high-follower accounts—often aligned with established influencers—command the majority of views and interactions under power-law distributions, undermining assertions that multimodal tools inherently democratize and instead highlighting entrenched inequalities in visibility and influence.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.