• Saturday, February 08, 2025
businessday logo

BusinessDay

How Azeez, Unilag’s student built a Nigerian accent AI text-to-speech model

How Azeez, Unilag’s student built a Nigerian accent AI text-to-speech model

Saheed Azeez, a University of Lagos student made a name for himself and the entire country after creating two million GPT tokens, he has built an artificial intelligence (AI) text-to-speech model with Nigerian accent.

According to Techpoint report, Azeez had earlier in 2024 create Naijaweb, a dataset of 230 million GPT-2 tokens based on Nairaland. However, in his new passion project he has pushed his creative skill further with YarnGPT, a text-to-speech AI model that can read text aloud in a Nigerian accent.

In a world where AI can generate lifelike voices in seconds, a text-to-speech model with a Nigerian accent might not seem revolutionary at first.

However, considering that Azeez is a university student with limited resources, and that developing a model that accurately captures the distinctions of a Nigerian accent is technically challenging, then it is a remarkable feat.

Azeez, speaking about his revolutionary invention after the success of Naijaweb, said; “The amount of conversations and interest people had in Naijaweb was a great motivation. Imagine getting featured on Techpoint Africa; it motivated me to do this.”

Besides, he was motivated by failure as well, because prior to his starting YarnGPT, he had applied for a job at a Nigerian AI company but did not perform as well in the interview as he had expected.

YarnGPT became the project that would help him improve his skills and increase his chances of securing such roles in the future.

Building an AI model that sounds Nigerian required gathering a vast amount of Nigerian voices.

“I used some movies that were available online. I extracted their audio and subtitles. The problem with building in Nigeria is data. Replicating what has been built overseas isn’t that hard, but data always gets in the way,” he explained.

For instance, Nollywood produces over 2,500 movies a year, and with many filmmakers uploading their work to YouTube, this gives him a lot of data to work with but that was not to be, as the opposite happened to be the case.

While there are thousands of movies for him to choose from the audio wasn’t up to the standard he wanted, and their subtitles were inaccurate. To compensate, Azeez turned to Hugging Face, an open-source platform for machine learning and data science.

He combined the audio from Nigerian movies with high-quality datasets from Hugging Face to train his model.

The next step was training the AI model, but without access to his own GPU, he had to rely on cloud computing services like Google Colab. This cost him $50 (₦80,000) a significant amount for a university student. Unfortunately, it was a waste.

“The model I built wasn’t working well, and the $50 cloud credit was burnt just like that. It was painful for me.”

Determined to find another way, he discovered Oute AI, a platform that had developed a text-to-speech model in an autoregressive manner.

“The way the model works is, you give it a piece of text, and it predicts one word at a time. It takes that word, adds it back to the text, then predicts the next one — kind of like how ChatGPT completes sentences. That’s what makes it autoregressive.”

While I found the autoregressive framework difficult to understand, Azeez pointed out that it simply gave him better results.

Oute AI provided a structure, but Azeez still had to build his own model. He took a language model called SmolLM2-360M from Hugging Face and added speech functionality to it, a process that involved major algorithmic changes.

After this, the final-year Mechanical Engineering student at the University of Lagos had to spend another $50 to train the model. The training took three days.

Interestingly, like he pointed out when he created Naijaweb, AI models need data to be tokenised. Large language models (LLMs) understand numbers, not words, so tokenisation converts words into numerical representations.

“If we were to tokenise the word CALCULATED, for example, we could split it into four tokens: CAL-CU-LA-TED. A number is assigned to each token.”

Meanwhile, tokenizing audio is basically breaking down continuous sound waves into smaller, manageable pieces that a model can understand and process.

Unlike text, which has clear breaks between words, audio is continuous, there are no natural pauses in a raw waveform.

“So, the model needs to convert the sound into a sequence of discrete values, kind of like turning a long speech into tiny puzzle pieces. These smaller audio tokens can then be used to train the AI, and later, the model can reassemble them to generate speech that sounds natural.”

This entire process was made possible by a wave tokenizer. Using resources from Hugging Face, Oute AI, and other Nigerian repositories, Azeez was able to create YarnGPT.

Charles Ogwo, Head, Education Desk at BusinessDay Media is a seasoned proactive journalist with over a decade of reportage experience.

Join BusinessDay whatsapp Channel, to stay up to date

Open In Whatsapp