Simple Text Similarity for Automated Text Matching

H-Barrio
Sep 13, 2022
4 min read

As a follow-up to our last post before our summer holidays, and in line with our public contracts search automation efforts in www.contratationfacil.es, we will automatically try to obtain the similarities between two distinct commodity classification schemes. We will use our automatically translated purchased element code, CPV, and the Spanish activity classification codes for companies CNAE.

We will use our translated files to perform the most probable match between a CPV and a CNAE to filter published public sector contracts by using both the commodity and the activity as features.

These are the modules we will need to use to generate the text matching:

!pip install unidecode
import pandas as pd
import re
import unidecode
import json
from google.colab import files

Pandas is used to handle the input data and create outputs, re and unidecode are used to clean text substrings, json is used to generate the output file.

We read the input file and separate its two worksheets into CPV and CNAE dataframes:

CNAE = pd.read_excel('/content/CPV_CNAE_EN.xlsx', 'CNAE_EN', dtype=str)
CPV = pd.read_excel('/content/CPV_CNAE_EN.xlsx', 'CPV_EN', dtype=str)

CPV.head()
CNAE.head()

A simple and effective method to determine the similarity of two sets of words is to calculate the Jaccard similarity of the sets. This external publication introduces the Jaccard similarity calculation or Jaccard index. We can adapt the code in this publication to suit our needs:

def jaccard_similarity(A, B):
    #Find intersection of two sets
    nominator = A.intersection(B)
    #Find union of two sets
    denominator = A.union(B)
    #Take the ratio of sizes
    similarity = len(nominator)/len(denominator)
    
    return similarity

We are calculating the ratio of the elements that intersect both sets to the union of those sets. Unfortunately, our CNAE and CPV codes are not sets yet, so we have to clear all non-alphanumeric characters from the string and keep just the unique elements of each string:

a = CNAE.iloc[0]['TITULO_CNAE2009']
print(a)
a = unidecode.unidecode(a)
print(a)
a = re.sub(r'[^A-Za-z0-9 ]+', '', a)
print(a)

Each of the print statements above produces the following output:

Cultivation of cereals (except rice), legumes and oilseeds 
Cultivation of cereals (except rice), legumes and oilseeds 
Cultivation of cereals except rice legumes and oilseeds

For example, the set of words for string a is then:

set(a.split())

{'Cultivation',  'and',  'cereals',  'except',  'legumes',  'of',  'oilseeds',  'rice'}

For convenience, we can create a function that will take a string and return the set of words in that string using our previous code:

def string_to_set(a_string):
  a_string = unidecode.unidecode(a_string)
  a_string = re.sub(r'[^A-Za-z0-9 ]+', '', a_string)
  return set(a_string.split())

We can test our function with a string b:

b = CPV.iloc[0]['Descripción 2003']
print(b)
print(string_to_set(b))

Products of agriculture, horticulture, hunting and related products. {'and', 'of', 'products', 'agriculture', 'horticulture', 'hunting', 'Products', 'related'}

if we calculate the Jaccard index for these two concepts, we obtain a value of 0.142.

jaccard_similarity(string_to_set(a), string_to_set(b))

The index we calculate has no intelligence beyond the appearance of the same word in both sets. If these words are semantically similar, our model will not be able to recognize this similarity. This approach is suitable for a first automated matching pass and will ultimately produce very simplistic results; it is as if we were reading the entries literally and ignoring similar meanings for different words. The initial matching can be fixed later, and the similarity analysis improved from this starting, simplified baseline., of course.

For this first pass, we will compare the Jaccard index for every CNAE to every CPV and keep the most similar pairs, that is, those who exhibit a greater similarity index:

by_jaccard = {}
for target_cnae in CNAE['TITULO_CNAE2009']:
  target_set = string_to_set(target_cnae)
  best_similarity = 0
  most_similar = None

  for cpv in CPV['Descripción 2003']:  
    cpv_set = string_to_set(cpv)
    similarity = jaccard_similarity(target_set, cpv_set)
    if similarity > best_similarity:
      best_similarity = similarity
      most_similar = cpv

  cnae_code = CNAE['COD_CNAE2009'][CNAE['TITULO_CNAE2009'] == target_cnae].values[0]
  match = CPV['Código CPV 2003'][CPV['Descripción 2003'] == most_similar].values

  
  print('----------------------------')
  if len(match) != 0:
    print(f'CANE {cnae_code}: {target_cnae}')
    print(f'CPV {match[0]}: {most_similar}')
    by_jaccard[cnae_code] = match[0]
  else:
    print(f'CANE {cnae_code}: {target_cnae}')
    print('No Match.')
    by_jaccard[cnae_code] = None
  print(f'Jaccard similarity: {best_similarity}')

with open('by_jaccard.json', 'w') as fp:
    json.dump(by_jaccard, fp)

files.download('by_jaccard.json')

The process takes a few seconds. These are some of the results where the similarity is easy to find when the exact words are used; the model cannot detect similar words; that is, it cannot generate a similarity vector of semantically equivalent words. As examples of good matching using Jacquard similarity:

And bad matching examples that exhibit the maximum possible similarity in exact wording but do not convey the same meaning:

This last "pieces of" example magically turn everything into... beef, generating strange results as the "of" word is present in both phrases and creating a meaningless similarity. This meaningless word is part of a set of non-semantic meaning-carrying words; they are dramatically necessary and semantically empty; these are the language-specific stop words. In our following publication, we will try to improve our activity matching by removing these stop words, as not everything is "of beef."

Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or intelligence gathering from satellite, drone, or fixed-point imagery. Also, check our AI-Powered Spanish public tender search application using sentence similarity analysis to provide better tender matches to selling companies.

Our www.contratacionfacil.es uses this CNAE-CPV matching to present better public contract tender search results.

The notebook for this demonstration is in this link.

OSTIRION

Simple Text Similarity for Automated Text Matching

Recent Posts

Comentarios