In our previous post, we used the human-readable daily update email to feed information to a machine inside a pandas dataframe. Today, as a contrast, we will use the machine-readable version of the same data to inform our machine learning algorithms. The main issue with underperforming machine learning models is the low quality or quantity of data used to train them. It is a good practice to get as much quality data as possible, even if outdated, to create a baseline model that can constantly learn from new data using older data as a basis. The machine will, eventually, learn new patterns from new data using good older data as a baseline.
That same data we read as an email intended for humans is published as an Atom Syndication Format XML file. This is not human readable, as a daunting screenshot of the displayed data shows. Moreover, the text is so long that it may clog a web browser, visit the link with care:
The machine will have no problem ingesting this monster publication. However, it will require some recursion as the main problem when accessing a database view is the deep nesting at which data tables are rendered as XML tags. We will iterate over the different entries of the database to read the XML and obtain a single table for that information of interest to us.
We will need these three Python modules:
from urllib.request import urlopen
from xml.etree import ElementTree as et
import pandas as pd
URL request will enable us to connect to the publishing site. XML will allow us to represent the XML tags as a tree (to parse it correctly). We will translate the relational database into a single table with that partial information we need using pandas. Trying to represent a complete database in a single pandas dataframe is complex; it is doable, although the representation of referenced tables makes it difficult to read and repetitive, just as the XML structure.
To access the publishing site, we can use:
source = 'https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_643/licitacionesPerfilesContratanteCompleto3.atom'
f = urlopen(source)
myfile = f.read()
header = 'http://www.w3.org/2005/Atom'
string = myfile.decode('UTF-8').replace(header,'')
The publishing site contains a header that we can remove no to confuse our XML parser. With this code, we create a large text string containing the publication itself in its entirety. Let us inspect some random 3000 characters from the middle of the text string we have created:
The structure is convoluted but readable. Tags are apparently well indented so that our XML parser should find no trouble in finding the nodes that contain data. We can parse it into a tree structure using:
root = et.fromstring(string)
We will not be successful if we try to navigate this tree from root to tree, looking for a consistent database. In general, databases exported to XML will generate multiple similarly indexed entries for different elements in the table, as an example, a person may have an ID entry as "ID" with a unique number; also, his country of residence may be coded with an "ID" code so that for a single user you will find a personal ID number and a country code, both entries stated as "ID: XXXXXX". Thus, the only representation that the XML tree leaves us to work with is the level of indentation where the specific ID value is located. Now, the combination of ID and level of indentation uniquely identifies both the personal ID and the country ID in this example.
We can inspect this on our own parsed text: marked in red are two different ID attributes with the same tag ('"ID") at different levels.
We will define a function to extract all these entries iteratively while keeping the level at which these were found, so we will have, just as examples ID-3 and ID-8, corresponding to the ID entries for different tables in the original database:
lvl = 0
def recursive_view(node):
global lvl
lvl = lvl + 1
if '}' in node.tag:
tag = node.tag.split('}')[1]
else:
tag = node.tag
if str(node.text).strip():
print(lvl,':',tag,':',node.text)
children = node.getchildren()
total_children = len(children)
if total_children == 0: lvl = lvl -1
for i in range(total_children):
recursive_view(children[i])
return
for node in root[-2:]:
print('-----------------------------------------------------\n')
print(node.tag, node.attrib)
if 'entry' in node.tag:
lvl = 0
recursive_view(node)
Starting at level 0, we recursively obtain all the children up to the leaves of the tree. At each recursion, we log the level at which the recursion is happening while we record the level together with the tag and the entry text. Checking the view for the last two entries, we obtain an indexed view of these similarly-named entries:
Using this same method, we should capture all those specific entries, by number and tag, that are of interest to us. So we can capture, for example, those entries that are generated in the human-readable email from our previous post:
lvl = 0
tags_to_capture = ['2:title',
'3:ContractFolderID',
'3:ContractFolderStatusCode',
'14:TotalAmount',
'15:ItemClassificationCode',
'26:EndDate',
'26:EndTime'
]
def get_entry(node):
global lvl
global tags_to_capture
lvl = lvl + 1
if '}' in node.tag:
tag = node.tag.split('}')[1]
else:
tag = node.tag
tag_id = str(lvl)+':'+tag
if tag_id in tags_to_capture:
entry[tag_id] = node.text
if tag_id in tags_to_capture:
entry[tag_id] = node.text
children = node.getchildren()
total_children = len(children)
if total_children == 0: lvl = lvl -1
for i in range(total_children):
get_entry(children[i])
return
all_entries = []
for node in root:
if 'entry' in node.tag:
lvl = 0
entry = {}
get_entry(node)
all_entries.append(entry)
The title of the contract is "2-title", the value of the contract is "14-TotalAmount", and so for every unique database entry. There are sufficiently few data elements in the database (117) to be manually inspectable. Also, most of these entries are not used in this specific application and are always blank, making the reconstruction of the database easier.
Our all_entries list contains individual dictionary entries that from a set of records. Thus, we can very easily create a pandas dataframe using the from_records function:
df=pd.DataFrame.from_records(all_entries)
There are many entries without any of our data elements of interest. These entries are database deletions or additions, for the most part. Dropping all the rows that contain all NaNs from our dataframe, we obtain the information we are looking for in a convenient format:
df.dropna(axis=0, how='all')
With this, we have added a single ATOM page. The header of the page contains a link to the previous publication. The publications form a chain that allows for the reconstruction of the complete database history by calling the functions in this post multiple times. We will try to reconstruct as much of the database as possible only for the subset of data elements that may interest us in a future post.
If you require data model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.
The notebook for this post is here.
コメント