Translating Text in Python

Posted on Fri 29 May 2020 in Data Science • 9 min read

Massive thank you to Chema, who translated this article himself into Spanish! Check it out at: https://www.ibidemgroup.com/edu/traduccion-automatica-texto-python/

Working with data in a connected digital world, means you will possibly encounter data in a language outside your own. In this post we'll go into ways to translate this data in Python.

First off we need some sample text, and what is better to read about then pizza!

In [1]:
# Source: http://saberitaliano.com.ar/reading/pizza.html
sample_text_it = """La pizza è un prodotto gastronomico che ha per base un impasto di acqua, farina di frumento, e lievito, lavorato fino a ottenere una forma piatta, cotto al forno e variamente condito.

Benché si tratti ormai di un prodotto diffuso in quasi tutto il mondo, la pizza è generalmente considerata un piatto originario della cucina italiana ed in particolar modo napoletana. Nel sentire comune, infatti, ci si riferisce con questo termine alla pizza tonda condita con pomodoro e mozzarella, ossia la variante più conosciuta della cosiddetta pizza napoletana.

La vera e propria origine della pizza è tuttavia argomento controverso: oltre a Napoli, altre città ne rivendicano la paternità. Esiste, del resto, anche un significato più ampio del termine "pizza". Infatti, trattandosi in ultima analisi di una particolare specie di pane o focaccia, la pizza si presenta in innumerevoli derivazioni e varianti, cambiando nome e caratteristiche a seconda delle diverse tradizioni locali. In particolare, in alcune aree dell'Italia centrale, viene chiamata "pizza" qualsiasi tipo di torta cotta al forno, salata o dolce e alta o bassa che sia."""

Note that we use triple quotes to contain this string, because this is a multiline string. Care must be taken when using triple quotes, as these strings will also contain special characters such as newline, tabs and more.

Google Translate API - Free!

If you've ever needed to translate anything from a language that you didn't know, chances are you've used Google Translate. It's almost magic how well this software performs, and the features it boasts are nothing short of amazing. On top of all of this, they offer a free API that we can access directly from our code! While there is some limitations, this is still an amazing start and if the program isn't required to translate mass amounts of text, this would be a straightforward solution.

To make this even simpler in Python, there is a package aptly named googletrans, which interfaces with the Google Translate API for us https://pypi.org/project/googletrans/. Let's begin by importing the package, and seeing what languages are supported.

In [2]:
import googletrans
print(googletrans.LANGUAGES)
{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'}

Next we need to create an instance of the translator class within googletrans. Another feature of Google Translate, is that it can automatically detect the language that the text is in, and we can use this feature in our code too! This is straightforward by passing our string to the method translator.detect(). This will return an instance of a class named Detected, in which we can see the detected language and how confident the package is in it's prediction.

In [3]:
translator = googletrans.Translator()

print(translator.detect(sample_text_it))
Detected(lang=it, confidence=1.0)

Now time to get translating! Once again, a straightforward method of translator.translate() is used by passing the string. We can also define the source & destination language to translate to. This returns an instance of the Translated class containing attributes of:

  • src - The source language to translate from
  • dest - The destination language to translate to
  • text - The translated text
  • origin - The original text
  • pronunciation - How to pronounce the translated text (although this only returns None in my experience)
In [4]:
translated = translator.translate(sample_text_it, src='it', dest='en')

print(translated)
Translated(src=it, dest=en, text=The pizza is a gourmet product that has as its base a mixture of water, wheat flour, and yeast, worked up to obtain a flat shape, baked and variously seasoned.

Although it now is a widespread product in almost all over the world, pizza is generally considered an original dish of Italian cuisine and especially Neapolitan way. In common feeling, in fact, we are referring to with this term round pizza topped with tomato sauce and mozzarella, which is the most well-known variant of the so-called Neapolitan pizza.

The real origin of pizza, however, is controversial: in addition to Naples, other cities claim its paternity. There is, moreover, also a broader meaning of the term "pizza". In fact, since in the final analysis of a particular kind of bread or focaccia, pizza presents itself in countless derivations and variations, changing name and characteristics depending on the different local traditions. In particular, in some areas of central Italy, it is called "pizza" any type of baked cake baked, salty or sweet, high or low it is., pronunciation=None, extra_data="{'translat...")

What if we had many strings that needed translating? Let's start by breaking our sample text into sections with the method str.splitlines(), this will create a list of each string that is separate by a newline (separate paragraphs).

We use the filter() function to rid the list of empty strings for neatness sake

In [5]:
sentence_list = sample_text_it.splitlines()
# Remove empty strings
sentence_list = list(filter(None, sentence_list))
print(sentence_list)
['La pizza è un prodotto gastronomico che ha per base un impasto di acqua, farina di frumento, e lievito, lavorato fino a ottenere una forma piatta, cotto al forno e variamente condito.', 'Benché si tratti ormai di un prodotto diffuso in quasi tutto il mondo, la pizza è generalmente considerata un piatto originario della cucina italiana ed in particolar modo napoletana. Nel sentire comune, infatti, ci si riferisce con questo termine alla pizza tonda condita con pomodoro e mozzarella, ossia la variante più conosciuta della cosiddetta pizza napoletana.', 'La vera e propria origine della pizza è tuttavia argomento controverso: oltre a Napoli, altre città ne rivendicano la paternità. Esiste, del resto, anche un significato più ampio del termine "pizza". Infatti, trattandosi in ultima analisi di una particolare specie di pane o focaccia, la pizza si presenta in innumerevoli derivazioni e varianti, cambiando nome e caratteristiche a seconda delle diverse tradizioni locali. In particolare, in alcune aree dell\'Italia centrale, viene chiamata "pizza" qualsiasi tipo di torta cotta al forno, salata o dolce e alta o bassa che sia.']

Let's create a function to handle the translation step. We instantiate a new translator each time the function is called, this also helps to reinitalise the Google Translate API on each time we translate. If the src_lang isn't defined the function call, let's use the magical method translator.detect() to make a prediction for us.

Now we iterate over the list, calling our new function repeatedly and finally appending the translated data into a new list translated_list.

In [6]:
def translate_text(text:str,src_lang:str=None,dest_lang:str="en"):
    translator = googletrans.Translator()
    if src_lang is None:
        src_lang = translator.detect(text).lang
    
    return translator.translate(text,src=src_lang,dest=dest_lang).text
        
translated_list = []

for sentence in sentence_list:
    translated_list.append(translate_text(sentence))

print(translated_list)
['The pizza is a gourmet product that has as its base a mixture of water, wheat flour, and yeast, worked up to obtain a flat shape, baked and variously seasoned.', 'Although it now is a widespread product in almost all over the world, pizza is generally considered an original dish of Italian cuisine and especially Neapolitan way. In common feeling, in fact, we are referring to with this term round pizza topped with tomato sauce and mozzarella, which is the most well-known variant of the so-called Neapolitan pizza.', 'The real origin of pizza, however, is controversial: in addition to Naples, other cities claim its paternity. There is, moreover, also a broader meaning of the term "pizza". In fact, since in the final analysis of a particular kind of bread or focaccia, pizza presents itself in countless derivations and variations, changing name and characteristics depending on the different local traditions. In particular, in some areas of central Italy, it is called "pizza" any type of baked cake baked, salty or sweet, high or low it is.']

DeepL Translator

Sometimes we need to translate much larger datasets. I recently faced this problem while volunteering for a COVID-19 project in which we wanted to run analysis on the tweets of Italy during the pandemic. We were supplied with millions of tweets over weeks in May 2020, all in Italian. As most of the team only spoke English, and as far as we knew sentiment analysis was developed for English extensively; we would need to translate all of this data. While Google Translate can be paid for, we were kindly donated an API key from the DeepL team for our cause. Following this, will be how the team & I set up the mass translator with pandas, requests and the DeepL Translator API.

As to not share the donated API key, responses from the DeepL translator will not be shown in this post.

Firstly, we will initialise our sample text (split by paragraph) from before, in a pandas DataFrame to represent our mass datasets.

In [7]:
import pandas as pd
import numpy as np
import requests
import json
In [8]:
source_text_df = pd.DataFrame(sentence_list,columns=["Source Text"])

source_text_df
Out[8]:
Source Text
0 La pizza è un prodotto gastronomico che ha per...
1 Benché si tratti ormai di un prodotto diffuso ...
2 La vera e propria origine della pizza è tuttav...

Now that we have our data arranged in a pandas DataFrame, we create a function for interfacing with the DeepL Translator API.

The documentation for the DeepL Translator API is fantastic, and can be reached at https://www.deepl.com/docs-api/introduction/.

Using build our request as a dictionary to be used with the requests package. Read more details about how payloads are formatted in the requests documentation.

The response back from DeepL is then in the JSON format, which we can then convert back into a dictionary with the response.json() method.

Finally we loop through the returned response, and reconstruct a list of the translated data to return from the function.

In [9]:
def translate_text_deepl(data,api_key,src_lang:str="IT",target_lang:str="EN"):
        # Create empty list
    translated_list = []

    try:
        # Translate all tweets and add to list
        # Request should not exceed 30KB
        parameters = {
            "text": data,
            "source_lang": src_lang,
            "target_lang": target_lang,
            "auth_key": api_key,
        }
        response = requests.get("https://api.deepl.com/v2/translate", params=parameters)
        deepl_response_data = response.json()
        for item in deepl_response_data.values():
            for key in item:
                translated_list.append(key["text"])
    except json.decoder.JSONDecodeError:
        # Insert error for each line in data
        for _ in data:
            translated_list.append("Error")
        print(f"Error translating.. `Error` placed in output dataset")

    return translated_list

Now that we have our function that interfaces with the DeepL Translator API, we can use this to translate all of our data easily by calling the function, that's it, done!

The output shows Error due to the API Key not being valid

In [10]:
source_text_df['Translated Text'] = translate_text_deepl(source_text_df['Source Text'],"APIKEY","IT","EN")
Error translating.. `Error` placed in output dataset

If the DataFrame is much larger in size, we may need to process this data in chunks. This can be done by using our new translate_text_deepl function from a different perspective.

Similarly we create a new list to store all of our translated data, and add in new data iteratively in chunks in our DataFrame. Chunks can be iterated through a DataFrame using the groupby method, along with with arange function in numpy.

The output shows Error due to the API Key not being valid

In [11]:
all_translated_data = []

chunk_size = 2

for _, chunk in source_text_df.groupby(np.arange(len(source_text_df)) // chunk_size):

    # Add new data to list
    # We use extend as the function returns a list
    all_translated_data.extend(translate_text_deepl(chunk['Source Text'],"APIKEY","IT","EN"))

source_text_df['Translated Text'] = all_translated_data
Error translating.. `Error` placed in output dataset
Error translating.. `Error` placed in output dataset