Multimodal interaction
Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data. Multimodal human-computer interaction involves natural communication with virtual and physical environments. It facilitates free and natural communication between users and automated systems, allowing flexible input (speech, handwriting, gestures) and output (speech synthesis, graphics). Multimodal fusion combines inputs from different modalities, addressing ambiguities. Two major groups of multimodal interfaces focus on alternate input methods and combined input/output. Multiple input modalities enhance usability, benefiting users with impairments. Mobile devices often employ XHTML+Voice for input. Multimodal biometric systems use multiple biometrics to overcome limitations. Multimodal sentiment analysis involves analyzing text, audio, and visual data for sentiment classification. GPT-4, a multimodal language model, integrates various modalities for improved language understanding. Multimodal output systems present information through visual and auditory cues, using touch and olfaction. Multimodal fusion integrates information from different modalities, employing recognition-based, decision-based, and hybrid multi-level fusion. Ambiguities in multimodal input are addressed through prevention, a-posterior resolution, and approximation resolution methods. IntroductionMultimodal human-computer interaction refers to the "interaction with the virtual and physical environment through natural modes of communication",[1] This implies that multimodal interaction enables a more free and natural communication, interfacing users with automated systems in both input and output.[2] Specifically, multimodal systems can offer a flexible, efficient and usable environment allowing users to interact through input modalities, such as speech, handwriting, hand gesture and gaze, and to receive information by the system through output modalities, such as speech synthesis, smart graphics and other modalities, opportunely combined. Then a multimodal system has to recognize the inputs from the different modalities combining them according to temporal and contextual constraints[3] in order to allow their interpretation. This process is known as multimodal fusion, and it is the object of several research works from the nineties to now.[4][5][6][7][8][9][10][11] The fused inputs are interpreted by the system. Naturalness and flexibility can produce more than one interpretation for each different modality (channel) and for their simultaneous use, and they consequently can produce multimodal ambiguity[12] generally due to imprecision, noises or other similar factors. For solving ambiguities, several methods have been proposed.[13][14][15][16][17][18] Finally the system returns to the user outputs through the various modal channels (disaggregated) arranged according to a consistent feedback (fission).[19] The pervasive use of mobile devices, sensors and web technologies can offer adequate computational resources to manage the complexity implied by the multimodal interaction. "Using cloud for involving shared computational resources in managing the complexity of multimodal interaction represents an opportunity. In fact, cloud computing allows delivering shared scalable, configurable computing resources that can be dynamically and automatically provisioned and released".[20] Multimodal inputTwo major groups of multimodal interfaces have merged, one concerned in alternate input methods and the other in combined input/output. The first group of interfaces combined various user input modes beyond the traditional keyboard and mouse input/output, such as speech, pen, touch, manual gestures,[21] gaze and head and body movements.[22] The most common such interface combines a visual modality (e.g. a display, keyboard, and mouse) with a voice modality (speech recognition for input, speech synthesis and recorded audio for output). However other modalities, such as pen-based input or haptic input/output may be used. Multimodal user interfaces are a research area in human-computer interaction (HCI). The advantage of multiple input modalities is increased usability: the weaknesses of one modality are offset by the strengths of another. On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say (e.g. Poughkeepsie). Consider how you would access and search through digital media catalogs from these same devices or set top boxes. And in one real-world example, patient information in an operating room environment is accessed verbally by members of the surgical team to maintain an antiseptic environment, and presented in near realtime aurally and visually to maximize comprehension. Multimodal input user interfaces have implications for accessibility.[23] A well-designed multimodal application can be used by people with a wide variety of impairments. Visually impaired users rely on the voice modality with some keypad input. Hearing-impaired users rely on the visual modality with some speech input. Other users will be "situationally impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired. On the other hand, a multimodal application that requires users to be able to operate all modalities is very poorly designed. The most common form of input multimodality in the market makes use of the XHTML+Voice (aka X+V) Web markup language, an open specification developed by IBM, Motorola, and Opera Software. X+V is currently under consideration by the W3C and combines several W3C Recommendations including XHTML for visual markup, VoiceXML for voice markup, and XML Events, a standard for integrating XML languages. Multimodal browsers supporting X+V include IBM WebSphere Everyplace Multimodal Environment, Opera for Embedded Linux and Windows, and ACCESS Systems NetFront for Windows Mobile. To develop multimodal applications, software developers may use a software development kit, such as IBM WebSphere Multimodal Toolkit, based on the open source Eclipse framework, which includes an X+V debugger, editor, and simulator.[citation needed] Multimodal biometricsMultimodal biometric systems use multiple sensors or biometrics to overcome the limitations of unimodal biometric systems.[24] For instance iris recognition systems can be compromised by aging irises[25] and electronic fingerprint recognition can be worsened by worn-out or cut fingerprints. While unimodal biometric systems are limited by the integrity of their identifier, it is unlikely that several unimodal systems will suffer from identical limitations. Multimodal biometric systems can obtain sets of information from the same marker (i.e., multiple images of an iris, or scans of the same finger) or information from different biometrics (requiring fingerprint scans and, using voice recognition, a spoken passcode).[26][27] Multimodal biometric systems can fuse these unimodal systems sequentially, simultaneously, a combination thereof, or in series, which refer to sequential, parallel, hierarchical and serial integration modes, respectively. Fusion of the biometrics information can occur at different stages of a recognition system. In case of feature level fusion, the data itself or the features extracted from multiple biometrics are fused. Matching-score level fusion consolidates the scores generated by multiple classifiers pertaining to different modalities. Finally, in case of decision level fusion the final results of multiple classifiers are combined via techniques such as majority voting. Feature level fusion is believed to be more effective than the other levels of fusion because the feature set contains richer information about the input biometric data than the matching score or the output decision of a classifier. Therefore, fusion at the feature level is expected to provide better recognition results.[24] Furthermore, the evolving biometric market trends underscore the importance of technological integration, showcasing a shift towards combining multiple biometric modalities for enhanced security and identity verification, aligning with the advancements in multimodal biometric systems.[28] Spoof attacks consist in submitting fake biometric traits to biometric systems, and are a major threat that can curtail their security. Multi-modal biometric systems are commonly believed to be intrinsically more robust to spoof attacks, but recent studies[29] have shown that they can be evaded by spoofing even a single biometric trait. One such proposed system of Multimodal Biometric Cryptosystem Involving the Face, Fingerprint, and Palm Vein by Prasanalakshmi[30] The Cryptosystem Integration combines biometrics with cryptography, where the palm vein acts as a cryptographic key, offering a high level of security since palm veins are unique and difficult to forge. The Fingerprint Involves minutiae extraction (terminations and bifurcations) and matching techniques. Steps include image enhancement, binarization, ROI extraction, and minutiae thinning. The Face system uses class-based scatter matrices to calculate features for recognition, and the Palm Vein acts as an unbreakable cryptographic key, ensuring only the correct user can access the system. The cancelable Biometrics concept allows biometric traits to be altered slightly to ensure privacy and avoid theft. If compromised, new variations of biometric data can be issued. The Encryption fingerprint template is encrypted using the palm vein key via XOR operations. This encrypted Fingerprint is hidden within the face image using steganographic techniques. Enrollment and Verification for the Biometric data (Fingerprint, palm vein, face) are captured, encrypted, and embedded into a face image. The system extracts the biometric data and compares it with stored values for Verification. The system was tested with fingerprint databases, achieving 75% verification accuracy at an equal error rate of 25% and processing time approximately 50 seconds for enrollment and 22 seconds for Verification. High security due to palm vein encryption, effective against biometric spoofing, and the multimodal approach ensures reliability if one biometric fails. Potential for integration with smart cards or on-card systems, enhancing security in personal identification systems.Multimodal sentiment analysisMultimodal sentiment analysis is a technology for traditional text-based sentiment analysis, which includes modalities such as audio and visual data.[31] It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities.[32] With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis,[33][34] which can be applied in the development of virtual assistants,[35] analysis of YouTube movie reviews,[36] analysis of news videos,[37] and emotion recognition (sometimes known as emotion detection) such as depression monitoring,[38] among others. Similar to the traditional sentiment analysis, one of the most basic task in multimodal sentiment analysis is sentiment classification, which classifies different sentiments into categories such as positive, negative, or neutral.[39] The complexity of analyzing text, audio, and visual features to perform such a task requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion.[33] The performance of these fusion techniques and the classification algorithms applied, are influenced by the type of textual, audio, and visual features employed in the analysis.[40]Multimodal language modelsGenerative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models.[41] It was launched on March 14, 2023,[41] and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot.[42] As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.[43]: 2 Observers reported that the iteration of ChatGPT using GPT-4 was an improvement on the previous iteration based on GPT-3.5, with the caveat that GPT-4 retains some of the problems with earlier revisions.[44] GPT-4, equipped with vision capabilities (GPT-4V),[45] is capable of taking images as input on ChatGPT.[46] OpenAI has not revealed technical details and statistics about GPT-4, such as the precise size of the model.[47]Multimodal outputThe second group of multimodal systems presents users with multimedia displays and multimodal output, primarily in the form of visual and auditory cues. Interface designers have also started to make use of other modalities, such as touch and olfaction. Proposed benefits of multimodal output system include synergy and redundancy. The information that is presented via several modalities is merged and refers to various aspects of the same process. The use of several modalities for processing exactly the same information provides an increased bandwidth of information transfer .[48][49][50] Currently, multimodal output is used mainly for improving the mapping between communication medium and content and to support attention management in data-rich environment where operators face considerable visual attention demands.[51] An important step in multimodal interface design is the creation of natural mappings between modalities and the information and tasks. The auditory channel differs from vision in several aspects. It is omnidirectional, transient and is always reserved.[51] Speech output, one form of auditory information, received considerable attention. Several guidelines have been developed for the use of speech. Michaelis and Wiggins (1982) suggested that speech output should be used for simple short messages that will not be referred to later. It was also recommended that speech should be generated in time and require an immediate response. The sense of touch was first utilized as a medium for communication in the late 1950s.[52] It is not only a promising but also a unique communication channel. In contrast to vision and hearing, the two traditional senses employed in HCI, the sense of touch is proximal: it senses objects that are in contact with the body, and it is bidirectional in that it supports both perception and acting on the environment. Examples of auditory feedback include auditory icons in computer operating systems indicating users' actions (e.g. deleting a file, open a folder, error), speech output for presenting navigational guidance in vehicles, and speech output for warning pilots on modern airplane cockpits. Examples of tactile signals include vibrations of the turn-signal lever to warn drivers of a car in their blind spot, the vibration of auto seat as a warning to drivers, and the stick shaker on modern aircraft alerting pilots to an impending stall.[51] Invisible interface spaces became available using sensor technology. Infrared, ultrasound and cameras are all now commonly used.[53] Transparency of interfacing with content is enhanced providing an immediate and direct link via meaningful mapping is in place, thus the user has direct and immediate feedback to input and content response becomes interface affordance (Gibson 1979). Multimodal fusionThe process of integrating information from various input modalities and combining them into a complete command is referred as multimodal fusion.[5] In literature, three main approaches to the fusion process have been proposed, according to the main architectural levels (recognition and decision) at which the fusion of the input signals can be performed: recognition-based,[9][10][54] decision-based,[7][8][11][55][56][57][58] and hybrid multi-level fusion.[4][6][59][60][61][62][63][64] The recognition-based fusion (also known as early fusion) consists in merging the outcomes of each modal recognizer by using integration mechanisms, such as, for example, statistical integration techniques, agent theory, hidden Markov models, artificial neural networks, etc. Examples of recognition-based fusion strategies are action frame,[54] input vectors[9] and slots.[10] The decision-based fusion (also known as late fusion) merges the semantic information that are extracted by using specific dialogue-driven fusion procedures to yield the complete interpretation. Examples of decision-based fusion strategies are typed feature structures,[55][60] melting pots,[57][58] semantic frames,[7][11] and time-stamped lattices.[8] The potential applications for multimodal fusion include learning environments, consumer relations, security/surveillance, computer animation, etc. Individually, modes are easily defined, but difficulty arises in having technology consider them a combined fusion.[65] It is difficult for the algorithms to factor in dimensionality; there exist variables outside of current computation abilities. For example, semantic meaning: two sentences could have the same lexical meaning but different emotional information.[65] In the hybrid multi-level fusion, the integration of input modalities is distributed among the recognition and decision levels. The hybrid multi-level fusion includes the following three methodologies: finite-state transducers,[60] multimodal grammars[6][59][61][62][63][64][66] and dialogue moves.[67] AmbiguityUser's actions or commands produce multimodal inputs (multimodal message[3]), which have to be interpreted by the system. The multimodal message is the medium that enables communication between users and multimodal systems. It is obtained by merging information that are conveyed via several modalities by considering the different types of cooperation between several modalities,[68] the time relationships[69] among the involved modalities and the relationships between chunks of information connected with these modalities.[70] The natural mapping between the multimodal input, which is provided by several interaction modalities (visual and auditory channel and sense of touch), and information and tasks imply to manage the typical problems of human-human communication, such as ambiguity. An ambiguity arises when more than one interpretation of input is possible. A multimodal ambiguity[12] arises both, if an element, which is provided by one modality, has more than one interpretation (i.e. ambiguities are propagated at the multimodal level), and/or if elements, connected with each modality, are univocally interpreted, but information referred to different modalities are incoherent at the syntactic or the semantic level (i.e. a multimodal sentence having different meanings or different syntactic structure). In "The Management of Ambiguities",[14] the methods for solving ambiguities and for providing the correct interpretation of the user's input are organized in three main classes: prevention, a-posterior resolution and approximation resolution methods.[13][15] Prevention methods impose users to follow predefined interaction behaviour according to a set of transitions between different allowed states of the interaction process. Example of prevention methods are: procedural method,[71] reduction of the expressive power of the language grammar,[72] improvement of the expressive power of the language grammar.[73] The a-posterior resolution of ambiguities uses mediation approach.[16] Examples of mediation techniques are: repetition, e.g. repetition by modality,[16] granularity of repair[74] and undo,[17] and choice.[18] The approximation resolution methods do not require any user involvement in the disambiguation process. They can all require the use of some theories, such as fuzzy logic, Markov random field, Bayesian networks and hidden Markov models.[13][15] See also
References
External links
|