Human Readable Data for Machines with Python

H-Barrio
Sep 21, 2021
5 min read

Machine-readable data, data that is easy to be accessed by a computer, must be structured. This is the formal definition for data that is ready for computer injection and, unless dealing with whole entropy systems, all data, all, is structured in a way or another. Human readable data is, in fact, highly structured. It should also be machine-readable. The problem a machine faces when reading data intended for humans is grammatic in nature; the organization of the data follows a more complex logic with multiple rule exceptions.

In some cases, as for the sample data we will use today, human-readable data appears in a more convenient media, although not format, than machine-readable data. The case for today is public sector tender and contract information as published by the Spanish Government in their webpage at https://contrataciondelestado.es/wps/myportal/plataforma. You can find an ad-hoc tool on this page to transform the open data set into an Excel file easily. It also offers a human-readable subscription service for new entries and updates that is very convenient. We are making this newsletter machine-readable today. The sample file will be the following text file:

This file contains all daily changes to public contracts. The shape is this:

We have the properties of the contracts in human-readable form, with a title, contract value, contracting offices, and other relevant data. However, the data is not consistent. Some entries may be missing, some contracts may have a header, and some fields may be empty. We will import the required modules as the first step to data treatment, no re module today, just pandas and NumPy:

import numpy as np
import pandas as pd

Then we ingest the text file into a line-by-line list:

txt_file = '20210920_trimmed.txt'
with open(txt_file,'r', encoding='UTF-8') as f:
    lines = f.readlines()

The text contains accented characters (as in camión), so encoding UTF-8 has to be specified not to ingest unreadable characters. As we can see in the first image, there are instances of two consecutive newline characters; a human would not notice or care. These double spaces mean nothing when reading it as a machine and may throw off our reading algorithms. We will delete one new line entry if followed by another new line entry:

# Remove extra blank lines:
idx = 0
for l in lines:
      if l == '\n' and lines[idx+1]=='\n':
        lines.pop(idx)        
      idx = idx + 1

Our best approach to detecting individual contracts could be locating the lines in which a new contract is described. The lines containing "Título del Contrato:" indicate a new entry. All the space between two lines with this string form a dataset for a single contract. We can locate all the lines where a new contract starts with:

title_idxs = [idx for idx, l in enumerate(lines) if  'Título del Contrato' in l ]
print(len(title_idxs))

We have 322 contracts in this file. With these contracts delimited by index, we need to define a function to extract the data entries of relevance to us for each of them. For example, we can define a list of data titles to scan the contract string for and add the data into a contract dictionary as a joined string:

entries_to_record = ['CPV:', 'Importe:', 'Órgano de Contratación:',
                    'Administración:',
                    'Fecha final de presentación de ofertas:',
                    'Número de Expediente:',
                    'Ver detalle:']
                    
def extract_contract_entries(contract_text):
    contract_data = {}
    contract_data['Título del Contrato:'] = contract_text[0]
    for l in contract_text:
        for e in entries_to_record:
            if e in l: 
                contract_data[e] = "".join(l.split(e)[1:])
    return contract_data

If a given line of the contract as the text contains any data titles, we will split the line by the ":" separator and get the entry behind it. The contracts must fully text again, so we make each contract a text string, removing the "\n" newline entries for readability by both us and the machine:

lines_as_str = ''.join(lines)
contract_titles = lines_as_str.split('Título del Contrato:')
contract_titles[-1].split('\n')

We joint all contract lines in a single string again, split them through the new contract identifier "Título del Contrato:". The last entry is interesting as it contains a farewell message destined to a human, to be duly ignored by the machine as it has no use for courtesy:

To extract data for all contracts, we loop over the range of contract indexes. The type of announcement is on top of the title line of each contract, so we add it as a data element for the contract. The first lines of contract titles will contain a new line, so we skip it:

contract_titles = contract_titles[1:]
n_contracts = len(contract_titles)
contracts_data = []
for i in range(n_contracts):
    contract_data =extract_contract_entries(
                                            contract_titles[i].split('\n'))
    contract_data['Estado:'] = lines[title_idxs[i]-1].split('\n')[0]
    contracts_data.append(contract_data)

The last entry in dictionary form looks like this now:

Possibly, the most convenient format for this data is tabular, so we transform the dictionary into a pandas dataframe using the dictionary as records:

df = pd.DataFrame.from_records(contracts_data)

The headers, the dates, and the price values are not suitable for a machine; these still show human-readable values. Therefore, we will strip the ":" from the headers and fixed dates and contract prices for the machine:

from datetime import datetime

df.columns = [c.strip(':') for c in df.columns]
df.rename(columns={'Fecha final de presentación de ofertas':'Fecha Limite'}, inplace=True)

dates = df['Fecha Limite']
source_format = '%d/%m/%Y %H:%M'
fill_date = datetime.today() + timedelta(days=1)
fill_date = fill_date.strftime(source_format)
dates = [d if d != '' else  fill_date for d in dates]
dates = [d[1:] if d[0]==' ' else d for d in dates]
df['Fecha Limite'] = dates
df['Fecha Limite'] = df['Fecha Limite'].apply(lambda x: pd.to_datetime(x, format=source_format))

We need DateTime to fill in missing values with a dummy date. It is worth moving the date into the future to check whether the contract with the missing limit date is still accepting tenders or not. We can filter out these "dummy" dates with other contract parameters. Dates are transformed into DateTime objects using the appropriate format; this article explains the formats very well.

To correctly format the euro price from the Spanish number notation to computer notation, we will use these lambda functions:

df['Importe'] = df['Importe'].apply(lambda x: 
                    float(x.split(" ")[1].split(",")[0].replace(".","")) +
                    float(x.split(" ")[1].split(",")[1])/100)
df['Importe'] = df['Importe'].apply(lambda x: '%.2f' % x)

In a single pass, the first "lambda monster" will split decimal Euro values from the total quantity and remove the thousands separator, then we add the decimal Euro values again. The second little one just reformats the numbers to avoid pandas scientific notation.

With this, our data is clean in terms of formatting and ready to be analyzed and used:

We will exploit this data and see how to pull it from daily alert emails automatically.

If you require data model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.

The notebook for this post is here.

OSTIRION

Human Readable Data for Machines with Python

Recent Posts

Comments