NAVER Unveils HyperCLOVA X–based Image and Speech Processing Technology, Advancing to “Multimodal Generative AI” l NAVER Corp.

Tech

NAVER Unveils HyperCLOVA X–based Image and Speech Processing Technology, Advancing to “Multimodal Generative AI”

2024.08.22

NAVER Unveils HyperCLOVA X–based Image and Speech Processing Technology, Advancing to “Multimodal Generative AI”

NAVER Unveils HyperCLOVA X–based Image and Speech Processing Technology, Advancing to “Multimodal Generative AI”

- From inferring situations in photos to analyzing tables and graphs, it is also possible to solve math shape problems, expanding the scope of CLOVA X as a productivity enhancement tool

- HyperCLOVA X–based voice multimodal technology has also been introduced on the Tech Blog: featuring natural conversation powered by a large language model

August 22, 2024

NAVER’s conversational AI agent, CLOVA X, will add visual information processing capabilities through a service update on the 27th. In addition, NAVER unveiled generative AI–based speech synthesis technology through the Tech Blog of CLOVA’s official website on the 20th. NAVER is advancing its competitiveness in generative AI technology by upgrading its HyperCLOVA X model to a “multimodal” AI that can process not only text but also images and voice simultaneously.

From inferring situations in photos to analyzing tables and graphs, recognizing products, and explaining their contents—expanding the scope of CLOVA X as a productivity enhancement tool

CLOVA X’s image understanding feature has been updated, enabling users to interact with AI based on information extracted from images uploaded to the CLOVA X dialog and queries entered. CLOVA X is capable of performing various tasks, such as describing phenomena in photos or inferring situations. It can also understand and analyze tables and graphs in the form of images or pictures. It is expected to be used for logical writing, code writing, translation, and other tasks and will be further utilized as a productivity enhancement tool based on its image understanding ability.

In particular, NAVER’s excellent know-how in AI-based document processing and character recognition technology, combined with HyperCLOVA X, a large language model (LLM) knowledgeable in various fields, will provide more accurate and reliable services. After it received 1,480 questions from the Republic of Korea GED exam in the form of images, which it was made to solve, CLOVA X showed a correct answer rate of about 84%, higher than the 78% rate of the OpenAI GPT-4o.

HyperCLOVA X–based voice multimodal technology has also been introduced on the Tech Blog: featuring natural conversation powered by a large language model

In addition, NAVER unveiled its HyperCLOVA X–based voice AI technology through the Tech Blog on CLOVA’s official website on the 20th. More advanced than the existing speech recognition and speech synthesis technology, this model utilizes the superior contextual understanding and directive interpretation capabilities of a LLM to improve language structure and pronunciation accuracy, as well as emotional expression.

NAVER, which has proven its technological competitiveness with various voice AI services such as “CLOVA Note” for voice recording, “CLOVA CareCall” for AI call support for the elderly, and “CLOVA Dubbing” for AI voice synthesis, is looking to provide more convenient services through its voice multimodal LLM technology. On its Tech Blog, NAVER presented the possibility of combining various services with a voice multimodal LLM, such as real-time voice translation, language learning, and counseling.

“HyperCLOVA X, which started as a LLM, is evolving into a large vision language model with image understanding capabilities and, finally, a voice multimodal LLM,” said Sung Nako, Head of Hyperscale AI at NAVER CLOUD. “We will introduce HyperCLOVA X’s advanced capabilities to various NAVER services, including CLOVA X, a conversational AI agent, to create new user value and offer it as an enterprise AI solution, further expanding the HyperCLOVA X ecosystem.”

Meanwhile, NAVER will actively practice “AI safety” in the process of upgrading HyperCLOVA X to a multimodal LLM and applying it to its services. Building on its AI Safety Framework (ASF), which was unveiled in June and evaluates the potential risks of AI systems, NAVER plans to continue to review voice AI technology, in particular, to provide safer services.

HyperCLOVA X Multimodal NAVER

Download all images

NAVER Unveils HyperCLOVA X–based Image and Speech Processing Technology, Advancing to “Multimodal Generative AI”

Related content