The dreams of a Korean ChatGPT: The present and the future

SHORT ARTICLES

In May of 2024, OpenAI once again stunned the world by announcing its latest innovation, the GPT-4o. This LLM (Large Language Model) distinguishes itself from its predecessor by being “multimodal,” no longer limited to text but seamlessly mixing audio and images into conversation. But as innovations in chatbots like ChatGPT and Google Gemini are receiving most of our attention, we notice a pattern: the LLMs we use today are dominated by English. While these English-based AI do have support for Korean, they have, for a long time, had a reputation for performing comparatively worse in non-English languages. This is perhaps one reason a Korean-based chatbot service would be more useful to Korean-speaking users. So what are some of these Korean-based LLMs that are in development today? And how well do they fare in the ever-changing world of AI?

Naver’s HyperClova X

Perhaps the most prominent Korean LLM in development is NAVER’s HyperClova X, which is featured in the company’s chatbot service CLOVA X and its AI-assisted search engine Cue.

The CLOVA X service, released in August 2023, is currently in open beta—anyone with a NAVER account can try it out. NAVER claims that this model was trained using a dataset of Korean content 6,500 times larger than that by ChatGPT, making the LLM much more fluent in Korean.

Since its release, the model has received positive feedback for its ability to fetch real-time data and provide sources for them. CLOVA X also allows users to select which “skills” may be used in the AI’s replies, which is a way of integrating the AI with different services, such as NAVER’s Shopping and Travel sections. Users also pointed out that CLOVA X performs better than ChatGPT at providing information that is specific to Korea and answering questions asked in informal Korean dialogue.

However, it has drawn some criticism, particularly in regard to the low accuracy of the replies the model provides. SBS News in September 2023 reported that the model fails to provide accurate data when asked about the number of seats in the Korean National Assembly and gives vague answers, especially when asked for its opinion. As of May 2024, CLOVA X still seems to struggle with basic algebra; when asked to “calculate 1 – 2 + 3 – 4,” it gave the wrong answer 0 in multiple conversations, though it did correct itself when asked to try again. When questioned about recent events, such as “the ICJ’s ruling on Israel’s Rafah offensive,” it gave a correct summary citing a news source, but also erroneously provided the year of the decision as 2023.

Models in Development

Other Korean companies have also proposed their own LLM models. Kakao is reported to be working on KoGPT 2.0, a model with a maximum of 66 billion parameters. Its reveal was planned for October 2023 but was ultimately postponed. NCSoft is reported to be working on VARCO, a model with a maximum of 13 billion parameters.

Foreign Competition

One problem these newly rising Korean-focused LLMs face is the fact that foreign models, such as GPT, are also becoming increasingly better at understanding Korean. An official report by OpenAI showed that GPT-4 has an accuracy of 77% in Korean, compared to 85.5% in English. While this accuracy in Korean seems comparatively low, it is far better than the English accuracy of the predecessor GPT-3.5, which was 70.1%. This means that the capabilities of GPT are improving regardless of language. GPT-4o was also announced to introduce a new tokenizer (an algorithm used to split text into smaller parts which the AI can understand) that uses fewer tokens for certain languages, especially languages that are not based on the Latin alphabet. This makes replies in non-English languages, including Korean, cheaper and faster to produce.

So, if foreign LLMs can understand the Korean language well—possibly even better than the Korean models—is developing those Korean models worth it?

The Korean LLMs’ Edge

A paper by Son et al., published in February 2024, documents the testing of 26 different LLMs using a newly proposed, Korea-specific benchmark: KMMLU. Each of the models was asked 35,030 questions across 45 subjects. Unlike the more popular benchmark MMLU, which the authors described as containing U.S.-centric questions, the KMMLU was fitted with questions that require “an understanding of Korean cultural practices, societal norms, and legal frameworks.”

When tested on these common LLMs, it was found that GPT-4 performed the best, achieving 59.95% accuracy, followed by HyperClova X at 53.40% and Gemini Pro at 50.18%. However, HyperClova X performed the best on questions requiring Korea-specific knowledge, with an accuracy of 55.21%, followed by GPT-4 at 54.89%. The authors noted that the models that are trained specifically on Korean text, like HyperClova X, consistently outperformed their counterparts on Korea-specific questions.

These results show that even though HyperClova X performed worse than GPT-4 overall, it was still able to answer questions about Korea more accurately than foreign AI. The fact that it is a Korean-based LLM seems to have given it an edge in its knowledge of Korean culture and society.

It is true that currently, Korean-focused LLMs are not as popular as their English-based counterparts. In the future, as the “big models” like GPT-4 improve their support for different languages, developing a model to be more fluent in a specific language than popular models may not be an easy task. Even so, one advantage such an LLM may bring, as seen through the benchmark testing, is the fact that it was trained on local data, which better reflects the social elements found commonly in the region.

In an interview with The Korea Economic Daily in 2023, NAVER CEO Choi Soo-Yeon stated, “What I have learned from operating a search engine business is that even when one searches the same word, the results in Korea, the US, and Japan may all be different […] Therefore, the key to developing a successful model would be to consider not only the language of each country but also its culture.” This aligns with the observations made by the authors of the paper: NAVER’s HyperClova X was able to generate more accurate answers to questions about Korean subjects—such as law, history, and culture. For those living in Korea, this is a significant advantage of Korean LLMs over the foreign ones that are popular today. Therefore, this deeper understanding of Korea—perhaps more important than Korean language fluency—could be what gives Korean LLMs a continuous edge against their globally dominant counterparts.

Choe Sung-min