Translating Text in Python
Posted on Fri 29 May 2020 in Data Science • 9 min read
Massive thank you to Chema, who translated this article himself into Spanish! Check it out at: https://www.ibidemgroup.com/edu/traduccion-automatica-texto-python/
Working with data in a connected digital world, means you will possibly encounter data in a language outside your own. In this post we'll go into ways to translate this data in Python.
First off we need some sample text, and what is better to read about then pizza!
# Source: http://saberitaliano.com.ar/reading/pizza.html
sample_text_it = """La pizza è un prodotto gastronomico che ha per base un impasto di acqua, farina di frumento, e lievito, lavorato fino a ottenere una forma piatta, cotto al forno e variamente condito.
Benché si tratti ormai di un prodotto diffuso in quasi tutto il mondo, la pizza è generalmente considerata un piatto originario della cucina italiana ed in particolar modo napoletana. Nel sentire comune, infatti, ci si riferisce con questo termine alla pizza tonda condita con pomodoro e mozzarella, ossia la variante più conosciuta della cosiddetta pizza napoletana.
La vera e propria origine della pizza è tuttavia argomento controverso: oltre a Napoli, altre città ne rivendicano la paternità. Esiste, del resto, anche un significato più ampio del termine "pizza". Infatti, trattandosi in ultima analisi di una particolare specie di pane o focaccia, la pizza si presenta in innumerevoli derivazioni e varianti, cambiando nome e caratteristiche a seconda delle diverse tradizioni locali. In particolare, in alcune aree dell'Italia centrale, viene chiamata "pizza" qualsiasi tipo di torta cotta al forno, salata o dolce e alta o bassa che sia."""
Note that we use triple quotes to contain this string, because this is a multiline string. Care must be taken when using triple quotes, as these strings will also contain special characters such as newline, tabs and more.
Google Translate API - Free!¶
If you've ever needed to translate anything from a language that you didn't know, chances are you've used Google Translate. It's almost magic how well this software performs, and the features it boasts are nothing short of amazing. On top of all of this, they offer a free API that we can access directly from our code! While there is some limitations, this is still an amazing start and if the program isn't required to translate mass amounts of text, this would be a straightforward solution.
To make this even simpler in Python, there is a package aptly named googletrans
, which interfaces with the Google Translate API for us https://pypi.org/project/googletrans/. Let's begin by importing the package, and seeing what languages are supported.
import googletrans
print(googletrans.LANGUAGES)
Next we need to create an instance of the translator class within googletrans
. Another feature of Google Translate, is that it can automatically detect the language that the text is in, and we can use this feature in our code too! This is straightforward by passing our string to the method translator.detect()
. This will return an instance of a class named Detected
, in which we can see the detected language and how confident the package is in it's prediction.
translator = googletrans.Translator()
print(translator.detect(sample_text_it))
Now time to get translating! Once again, a straightforward method of translator.translate()
is used by passing the string. We can also define the source & destination language to translate to. This returns an instance of the Translated
class containing attributes of:
src
- The source language to translate fromdest
- The destination language to translate totext
- The translated textorigin
- The original textpronunciation
- How to pronounce the translated text (although this only returnsNone
in my experience)
translated = translator.translate(sample_text_it, src='it', dest='en')
print(translated)
What if we had many strings that needed translating? Let's start by breaking our sample text into sections with the method str.splitlines()
, this will create a list
of each string that is separate by a newline (separate paragraphs).
We use the
filter()
function to rid the list of empty strings for neatness sake
sentence_list = sample_text_it.splitlines()
# Remove empty strings
sentence_list = list(filter(None, sentence_list))
print(sentence_list)
Let's create a function to handle the translation step. We instantiate a new translator each time the function is called, this also helps to reinitalise the Google Translate API on each time we translate. If the src_lang
isn't defined the function call, let's use the magical method translator.detect()
to make a prediction for us.
Now we iterate over the list, calling our new function repeatedly and finally appending the translated data into a new list translated_list
.
def translate_text(text:str,src_lang:str=None,dest_lang:str="en"):
translator = googletrans.Translator()
if src_lang is None:
src_lang = translator.detect(text).lang
return translator.translate(text,src=src_lang,dest=dest_lang).text
translated_list = []
for sentence in sentence_list:
translated_list.append(translate_text(sentence))
print(translated_list)
DeepL Translator¶
Sometimes we need to translate much larger datasets. I recently faced this problem while volunteering for a COVID-19 project in which we wanted to run analysis on the tweets of Italy during the pandemic. We were supplied with millions of tweets over weeks in May 2020, all in Italian. As most of the team only spoke English, and as far as we knew sentiment analysis was developed for English extensively; we would need to translate all of this data. While Google Translate can be paid for, we were kindly donated an API key from the DeepL team for our cause. Following this, will be how the team & I set up the mass translator with pandas
, requests
and the DeepL Translator API.
As to not share the donated API key, responses from the DeepL translator will not be shown in this post.
Firstly, we will initialise our sample text (split by paragraph) from before, in a pandas DataFrame
to represent our mass datasets.
import pandas as pd
import numpy as np
import requests
import json
source_text_df = pd.DataFrame(sentence_list,columns=["Source Text"])
source_text_df
Now that we have our data arranged in a pandas DataFrame
, we create a function for interfacing with the DeepL Translator API.
The documentation for the DeepL Translator API is fantastic, and can be reached at https://www.deepl.com/docs-api/introduction/.
Using build our request as a dictionary to be used with the requests
package. Read more details about how payloads are formatted in the requests documentation.
The response back from DeepL is then in the JSON format, which we can then convert back into a dictionary with the response.json()
method.
Finally we loop through the returned response, and reconstruct a list of the translated data to return from the function.
def translate_text_deepl(data,api_key,src_lang:str="IT",target_lang:str="EN"):
# Create empty list
translated_list = []
try:
# Translate all tweets and add to list
# Request should not exceed 30KB
parameters = {
"text": data,
"source_lang": src_lang,
"target_lang": target_lang,
"auth_key": api_key,
}
response = requests.get("https://api.deepl.com/v2/translate", params=parameters)
deepl_response_data = response.json()
for item in deepl_response_data.values():
for key in item:
translated_list.append(key["text"])
except json.decoder.JSONDecodeError:
# Insert error for each line in data
for _ in data:
translated_list.append("Error")
print(f"Error translating.. `Error` placed in output dataset")
return translated_list
Now that we have our function that interfaces with the DeepL Translator API, we can use this to translate all of our data easily by calling the function, that's it, done!
The output shows Error due to the API Key not being valid
source_text_df['Translated Text'] = translate_text_deepl(source_text_df['Source Text'],"APIKEY","IT","EN")
If the DataFrame is much larger in size, we may need to process this data in chunks. This can be done by using our new translate_text_deepl
function from a different perspective.
Similarly we create a new list to store all of our translated data, and add in new data iteratively in chunks in our DataFrame. Chunks can be iterated through a DataFrame using the groupby
method, along with with arange
function in numpy
.
The output shows Error due to the API Key not being valid
all_translated_data = []
chunk_size = 2
for _, chunk in source_text_df.groupby(np.arange(len(source_text_df)) // chunk_size):
# Add new data to list
# We use extend as the function returns a list
all_translated_data.extend(translate_text_deepl(chunk['Source Text'],"APIKEY","IT","EN"))
source_text_df['Translated Text'] = all_translated_data