Translating the Contents of Excel Files with Python

The title does not lie; in this post, we will translate the content of an excel file, several rows and columns of text in Spanish, into English using pandas and the Google translate Ajax API through the googletrans module.


For this particular example, we are translating into English the Spanish versions of the Common Procurement Vocabulary codes, CPV codes, adopted by regulation in the European Union to harmonize the classification of public tendering purchases. It is also helpful in the private sector as it offers a maintained procurable commodity classification chart in this link, similar to what S1000D breakdown does for complex defense, aerospace, and naval systems.


Our target excel file is this one; there is nothing special, CPV codes and CNAE codes; CNAE codes are the Spanish national activity classification for enterprise activities. In a follow-up post, we will let the machine find what CPV is close to what CNAE code, one being the commodity to purchase and the other being the activity of a given company:

correspondecia_CPV_CNAE
.xlsx
Download XLSX • 623KB

To perform the translation, we will need this very specific googletrans version:

!pip install pip install googletrans==3.1.0a0

This version is needed because of this issue:

https://github.com/ssut/py-googletrans/issues/260#issuecomment-1189236174


We bring in pandas, the translation module, and some timing utilities to check the performance of translation speed. There will be two bottlenecks, the speed iterating through pandas row entries and the API network calls to obtain the translation that we will measure:

import pandas as pd
from googletrans import Translator as T
import time
import tqdm.notebook as tq

Loading the excel file as a dataframe shows a dirty NaN column that we can remove by dropping the last column, index -1:

CNAE = pd.read_excel('/content/correspondecia_CPV_CNAE.xlsx', 'CNAE', dtype=str)
CPV = pd.read_excel('/content/correspondecia_CPV_CNAE.xlsx', 'CPV2003', dtype=str)
CNAE.head()

CPV.drop(columns=[CPV.columns[-1]], inplace=True)
CPV.head()


Next, we create a translator object, T, and select the columns to translate; in the case of the CPV dataframe, now the last column is the one we will translate:

translator = T()
col = CPV.columns[-1]

We will iterate over the rows in the CPV dataframe using itertuples while measuring the time the translation takes; there are 9000 terse phrases to translate; spoiler alert, it could take more than 15 minutes in Colab:

for t in tq.tqdm(CPV.itertuples()):
  index = t[0] 
  start = time.time()
  translation = translator.translate(t[-1],
                                     lang_src="auto",
                                     lang_tgt="en")
  end = time.time()
  print(f'\nTranslation time: {end-start:.3f} seconds.')
  print(f'{t[-1]}\n{translation.text}', end="")
  CPV.iloc[index][col] = translation.text


As it seems too long a time for such a simple translation task, let's define a translation function to use an apply method to our original CPV dataframe:

def translate(text):
  t = translator.translate(text, lang_src="auto", lang_tgt="en")
  return t.text

The function will take the text content and return the translated text. We can apply this to the specific column of the CPV dataframe; we reload it for the sake of illustration:

CPV = pd.read_excel('/content/correspondecia_CPV_CNAE.xlsx', 'CPV2003', dtype=str)
CPV.drop(columns=[CPV.columns[-1]], inplace=True)

col = CPV.columns[-1]
start = time.time()
CPV[col] = CPV[col].apply(lambda x: translate(x))
end = time.time()
print(f"Apply took {end-start:.3f} seconds")

The time is now around 275 seconds, much faster than the row iteration method; the network call and API response for a very short text string may be faster than a row iteration in pandas. We apply this same more rapid method to our CNAE dataframe:

col = CNAE.columns[-1]
start = time.time()
CNAE[col] = CNAE[col].apply(lambda x: translate(x))
end = time.time()
print(f"Apply took {end-start:.3f} seconds")

And check the translated results, which seem good:

CNAE and CPV translated into english.

The remaining step is to return our translations to excel format using a pandas ExcelWriter object:

writer = pd.ExcelWriter("CPV_CNAE_EN.xlsx")
CPV.to_excel(writer, sheet_name = 'CPV_EN', index=False)
CNAE.to_excel(writer, sheet_name = 'CNAE_EN', index=False)
writer.save() 

Your translated excel files are ready to download now. In our subsequent publication, we will find the most similar CPV to a CNAE and vice versa. We will try to link a proposed procurement object to the most probable declared company activity.


Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or intelligence gathering from satellite, drone, or fixed-point imagery. Also, check our AI-Powered Spanish public tender search application using sentence similarity analysis to provide better tender matches to selling companies.


www.contratacionfacil.es


The notebook for this demonstration is in this link.

278 views0 comments

Recent Posts

See All