Obtaining and Recording ATOM Data Chains

In continuation to our previous posts on getting the information from ATOM published sources, this one and this one, we will obtain the complete publication data corresponding to a single day using the Spanish public contracts open data publications. The header of the ATOM files contains references to the following file in the data chain, located in the header of the document:

We will grab all the documents in the chain that belong to the same publication date to obtain the complete daily "bulletin" that includes all changes made to all entries in the publication on that given day. With this information in hand, it is possible to reconstruct the current state of the complete data set.

We will start by installing (if needed) and importing the modules we will use for this task:

from urllib.request import urlopen
from xml.etree import ElementTree as et
import pandas as pd
import re
from datetime import datetime

We need to request and open content in a network URL; we need to parse an XML tree, add our data into a pandas dataframe, and handle regex tasks and datetime tasks.

Not to clog our system with all data entries, we will define a set of "keys of interest" to gather. These keys were determined in our previous posts and contain the most relevant data regarding public sector publications. We also have to define and initialize a "level 0" counter that sadly acts as a global variable to handle our recursive key-finding leveling. This is not a very desirable solution; it just illustrates how to quickly access nested XML data to reconstruct a database with multiple tables:

lvl = 0
tags_to_capture = ['2:title', 

Our source is the publishing endpoint for the Spanish Government public contracts open data site:

source = 'https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_643/licitacionesPerfilesContratanteCompleto3.atom'

The following battery of functions will enable us, in turn, to get and read a given ATOM file, extract the header from the file (that is contained in the first 14 lines, just for this publication type), find the date of publication, get the next element in the chain and get all the files on it until the date is no longer the same. Thus, with these functions, we can capture a full single publication day:

def get_file(url):
    f = urlopen(url)
    myfile = f.read()
    header = 'http://www.w3.org/2005/Atom'
    string = myfile.decode('UTF-8').replace(header,'')
    return string

def extract_header(file, cut_off=14):
    header = file.split('\n')[:cut_off]
    return header    

def get_date(file):
    header = extract_header(file)
    date_pattern = "<updated>(.*?)\</updated>"
    date = None
    for l in header:
        if '<updated>' in l:
            date= re.search(date_pattern, l).group(1)
    return date

def get_next_url(file):
    header = extract_header(file)
    next_pattern = '<link href=(.*?)\ rel="next"/>'
    next_file = None
    for l in header:
        if 'next' in l:
            next_file = re.search(next_pattern, l).group(1)
    return next_file.strip('"')

def get_same_day(file):    
    needed_urls = [file]
    file = get_file(file)    
    date = get_date(file)[0:10]
    next_date = date    
    print('Start fetching chain...')
    root = et.fromstring(file) 
    while True:
        next_url = get_next_url(file)
        new_file = get_file(next_url)        
        next_date = get_date(new_file)[0:10]
        if next_date != date: 
        file = new_file
        root = et.fromstring(file)
        all_entries = []            
    return needed_urls, date

The get_same_day function will provide us with the needed files for a single date and the date. Publishing happens several times every day at an irregular schedule as it depends on the amount of data to be published. It would be a good practice to run the code every few hours to generate fully updated records, with a last-minute run happening close to midnight in our time zone of interest. The code can be adapted to fetch the last two or three days to hunt for lost publication time frames with more time and effort. Using this method for more extended capture periods, we would have to capture the publishing time on top of the day so that the database can be fully reconstructed. For consultation purposes, the last state is the database is of importance, being the entire transaction record of secondary business value:

needed_files, date = get_same_day(source)

With the files in hand, we can iterate over them using the key-value reading functions that were presented in our previous posts:

def get_entry(node, entry):
    global lvl
    global tags_to_capture

    lvl = lvl + 1

    if '}' in node.tag:
        tag = node.tag.split('}')[1]
        tag = node.tag

    tag_id = str(lvl)+':'+tag
    if tag_id in tags_to_capture:
      entry[tag_id] = node.text

    if tag_id in tags_to_capture:
    # if True: # Gets everynode
      entry[tag_id] = node.text

    children = node.getchildren()
    total_children = len(children)
    if total_children == 0: lvl = lvl -1 
    for i in range(total_children):
        get_entry(children[i], entry)

def get_df(root):
    global lvl
    all_entries = []
    for node in root:
        if 'entry' in node.tag:
            lvl = 0
            entry = {}     
            get_entry(node, entry)
    df.dropna(axis=0, how='all', inplace=True)
    return df

With these tools in hand, we can now feed the complete tree from the needed daily files into a pandas dataframe. We use the subset of tags to be captured to avoid an unnecessarily large dataframe with not relevant information:

full_df = pd.DataFrame(columns = tags_to_capture)
full_df.columns = tags_to_capture

for file in needed_files:
    root = et.fromstring(get_file(file))
    df = get_df(root)
    full_df = full_df.append(df)   

If any entry is entirely empty, we can drop it from the dataframe:

df = full_df.dropna(axis=0, how='all')

Making a simple request for the daily published data for a single textual query, we obtain a view of the gathered dataframe:

by_text = df[df['2:title'].str.contains('software', na=False)]
by_text[by_text['3:ContractFolderStatusCode'] == 'PUB']

For posterity, we save the current daily dataframe state:

name = date[0:10].replace('-','') + '_record'

Further use of the data is possible; in our own case, we use it to filter some precise keywords from daily publications. Finally, this open data set can feed machine learning and AI algorithms to support companies seeking public contracts, mainly construction, healthcare, transportation, and education, with more effort and computation.

If you require data model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.

The notebook for this post is here.

8 views0 comments

Recent Posts

See All