OpenAI is wrong: they do NOT support over 90 languages with their whisper module

Photo by saeed karimi on Unsplash
Photo by saeed karimi on Unsplash

The claim made by OpenAI regarding their whisper module supporting over 90 languages for AI-based open-source speech-to-text is, unfortunately, NOT accurate. This article delves into the accuracy issues of the whisper module across these languages and presents the findings of a survey covering 98 languages utilizing the whisper module.

A conversation between Dr. Fatih Sen and ChatGPT

Some Technical Details of Whisper

Whisper comes in different sizes such as tiny (trained with 39 million parameters), base (trained with 74 million parameters), small (trained with 244 million parameters), medium (trained with 769 million parameters), and large (trained with 1.5 billion parameters). This survey study was conducted on the base size of the whisper module.

For those curious about setting up Whisper on their local environments, simply execute the following command (assuming Python is already installed on your system).

pip install openai-whisper

Note that this survey was conducted on ToText 1.2 version running on Python 3.10.2 version. ToText has the openai-whisper (version 20230308) python module installed and uses Whisper’s “base” size. Here is a code sample of how to use the model:

model = whisper.load_model("base")result = model.transcribe(_audio_file_path, language=_lang)print(result["segments"])

For more technical details, please check out this GitHub page.

Navigating Challenges of ToText’s Language Accuracy

In 2023, Fatih Sen, a computer science professor, developed ToText, an ai-based, online, and free transcription service. He used OpenAI’s open source whisper module for the transcription and included 99 languages in the ToText platform.

After the ToText platform was launched, he began receiving unfavorable and unpleasant feedback from users. Certain languages were functioning poorly, while others weren’t functioning at all. This got him thinking. It wasn’t ideal to inform users that his system supported up to 99 languages when many of them weren’t functioning properly. He reached a point where each language needed thorough testing.

ToText (version 1.1) with 99 languages integrated

As a solo entrepreneur juggling a full-time job, he knew that conducting such an experiment required more than one person’s effort; it needed teamwork. That’s when he had a lightbulb moment: why not propose this experiment to the capstone class students? This wasn’t his first time suggesting and supervising a capstone project at the department of computer science where he work at; he had done it before. He proposed the idea, and fortunately ToText was chosen by a team in the capstone class. It was time to guide those capstone students who took on this project and begin testing all those languages.

A capstone project is senior-level project course that allows students to solve a real-world problem towards the end of their major. The course instructors assign projects involving collaborations with organizations or companies seeking solutions to real problems. Instead of just reading textbooks or taking exams, students team up with real people or companies facing real challenges. This hands-on experience gives them priceless industry insights and prepares them for the real world before they graduate.

Let’s Get Started: Conducting a Comprehensive Survey Study

Using the ToText platform, the capstone team conducted a survey study with native speakers of 98 languages in person as well as through Zoom or FaceTime.

They interviewed 2 to 3 people for each language, with the scores being 0 – makes no sense at all to 5 – perfect transcription. They used various famous phrases/songs/bible verses that were in the native language and transcribed those and then asked the native speakers to score out of 5. They took the averages of the scores for the final score of each language. Note that the team could not find any native speakers for the 12 languages (Table 1) in its first round. Also, the Hindi language transcription does not work at all. Instead, the whisper module translates the speech in the video to English.

Survey Results

The capstone team obtained the following results after conducting survey for the 98 languages:

LANGUAGEROUND 1 SCORE (Out of 5)ROUND 2 SCORE (Out of 5)AVERAGE
Polish555
Russian555
Galician444
Lithuanian444
Norwegian444
Nynorsk444
Portuguese444
Romanian444
Slovak444
Spanish444
Javanesecould not fine any native speaker44
Sindhicould not fine any native speaker44
English3.543.75
Swedish3.543.75
Bulgarian343.5
French343.5
Hebrew343.5
Italian343.5
Azerbaijani33.53.25
Indonesian33.53.25
Japanese33.53.25
Bengali333
Korean333
Macedonian333
Malay333
Serbian333
Ukrainian333
German2.532.75
Arabic23.52.75
Slovenian232.5
Thai232.5
Turkish232.5
Vietnamese232.5
Dutch232.5
Welsh22.52.25
Finnish22.52.25
Hungarian22.52.25
Greek222
Kazakh222
Lao222
Latin222
Maori222
Persian222
Shona222
Telugu222
Turkmen222
Urdu222
Bosnian222
Icelandic222
Bashkircould not fine any native speaker22
Tagalogcould not fine any native speaker22
Belarusian211.5
Chinese211.5
Albanian121.5
Armenian121.5
Catalan121.5
Croatian121.5
Danish121.5
Luxembourgish121.5
Punjabi121.5
Tamil121.5
Amharic111
Basque111
Breton111
Faroese111
Gujarati111
Haitian Creole111
Hausa111
Latvian111
Lingala111
Maltese111
Mongolian111
Myanmar111
Pashto111
Sanskrit111
Sinhala111
Somali111
Sundanese111
Tajik111
Yiddish111
Hawaiiancould not fine any native speaker11
Kannadacould not fine any native speaker11
Khmercould not fine any native speaker11
Tibetancould not fine any native speaker11
Czech010.5
Nepali010.5
Afrikaans000
Assamese000
Estonian000
Georgian000
Swahili000
Uzbek000
Yoruba000
Malayalamcould not fine any native speaker00
Marathicould not fine any native speaker00
Occitancould not fine any native speaker00
Tatarcould not fine any native speaker00
Hinditranslates to englishtranslates to englishN/A
Table 1: Survey results of 98 languages transcribed with the whisper module

Below, you can see a graph of survey results of 98 languages using the whisper module. The average scores from Table 1 were used for this graph.

The team had two rounds. In the first round, they interviewed only a single native speaker for three different videos. In the second round, they interviewed multiple native speakers per language. Here is the summary of survey results.

  • 2 languages had an average score of 5, which is excellent (perfect transcription).
  • 10 languages had an average of 4 which is very good (very correct transcription).
  • 15 languages received an average between 3 and 4 which is good (correct transcription).
  • 24 languages obtained an average score between 2 and 3 which is average (medium transcription).
  • 33 languages received an average score between 1-2 meaning the transcriptions were minimally correct (poor transcription).
  • The rest of languages had an average score below 1, meaning the transcriptions made no sense at all (terrible transcription).
  • 1 language (Hindi) would not transcribe but translate instead.

Language-Specific Insights

Polish and Russian: It was surprising to see these languages performing the best among other languages. The relatively high scores in these languages could indicate a stronger dataset presence or more effective training of the AI models specific to these linguistic patterns. Because there are a small number of speakers it may be possible that the language does not have many dialects, so is easier to get a holistic view of the language. For Russian, it’s likely that there is one main way of speaking that reaches the internet, so it does not take into account some more unorthodox dialects or accents. It’s worth exploring why these languages yield better results to potentially apply similar strategies to the underperforming languages.

English and Chinese: These languages are among the most widely spoken globally, meaning there’s a diverse range of accents and dialects likely present in the dataset. Chinese isn’t even really a language as Mandarin and Cantonese are both accepted “Chinese” languages. Because of the variety of the data in the dataset, it makes sense that these wouldn’t be perfect and the Chinese translation makes even more sense being bad, considering that it is such an inaccurate generalization.

Galician, Lithuanian, Nynorsk, Javanese, Sindhi: It’s remarkable how these less-known and uncommon languages scored highly. They even surpassed widely used languages like English, German, and French. However, it’s worth noting that native speakers for Javanese and Sindhi were not available in the initial round. Therefore, it’s prudent to approach these two languages with caution, as their scores might have been lower with native speaker input in the first round.

Czech, Nepali, Afrikaans, Assamese, Estonian, Uzbek,… Hindi: The exceptionally low scores (below than 1) for these languages suggest that Whisper’s transcription algorithms may lack the necessary nuances of these linguistic structures or that the audio quality in the test cases was insufficient for accurate transcription. It’s critical to investigate whether these languages suffer from systemic issues within the transcription service, such as poor recognition of accents or dialects. Hindi does not even work with the model at all. When the module tries to transcribe to Hindi it just translates it to English instead.

Refining ToText: Enhancing Transcription Accuracy

Following the survey results, 48 languages, including Malagasy which wasn’t tested by the capstone team, were removed from ToText. As of May 2024, only 51 languages scoring 2 or higher were retained for user access. Additionally, the claim “…with over 95% accuracy.” was removed from the main page. This change was made due to the difficulty in defining a specific accuracy metric given the platform’s support for multiple languages.

ToText has the potential to enhance its accuracy by shifting from the “base” size to the “large” size. It’s worth noting that the initial experiment utilized the base size, and further investigation through a separate survey study would be necessary to gauge any significant accuracy improvements with larger sizes across different languages. However, employing a larger size like “large” would necessitate a minimum of 10GB memory and could potentially impact performance. At this stage, it’s advisable for ToText to continue with the base model to avoid additional costs associated with hardware upgrades, increased memory usage, and potential performance issues. Nonetheless, this transition could mark a positive stride for ToText as it endeavors to provide enhanced transcription services with thoroughly trained datasets and parameters.

A different approach to improving transcription accuracy is to explore speech-to-text AI services offered by companies other than OpenAI. Companies like Amazon, Google, IBM, and others provide such services, which could be worth considering for future use by ToText.

At present, ToText is freely accessible to all users. If the ToText team intends to monetize the platform, one strategy to explore is integrating human involvement. While AI technology continues to advance, human input remains crucial for accurate transcription. Therefore, as part of future developments, the ToText team could explore incorporating human assistance alongside AI capabilities. This approach can enhance transcription accuracy and potentially generate substantial revenue for the platform.

ToText (version 1.2) with 51 languages integrated

Conclusions

The variance in scores across different languages points to a significant inconsistency in Whisper’s performance. This inconsistency could stem from varied dataset sizes, the complexity of phonetic and grammatical structures, or differing levels of commercial and research focus within the AI development community for these languages.

Whisper (base size) is a good tool for homogeneous languages, especially for romance languages known as the Latin or Neo-Latin languages. Many times for languages that are not based in Latin or don’t have a similar alphabet to it, the model will just return a phonetic transcription which is much less useful. It is possible that some tweaking needs to be done so the model can have a better definition of what a transcription actually is. Whisper is fine for personal use for most people who reside in a Western country but for larger-scale projects, it would need a lot of work, as it is not perfect even for the romance languages.

These results could be beneficial for OpenAI for improving their whisper module to have a better transcription service, especially for those low-performing languages.

The demo day of the capstone class held in April, 2024 at the University of Memphis.
From left to right: Dr. Amy Cook (capstone course instructor), Elijah Lewis (capstone course student), Kamaal Orgard (capstone course student), Jaylon Taylor (capstone course student), Dr. Fatih Sen (owner of ToText, comp sci instructor and capstone project supervisor), Dr. Brandon Booth (capstone course instructor).

Discover more from Merjek

Subscribe to get the latest posts sent to your email.

Leave a comment