Removing Stop Words for Simple Phrase Similarity Analysis

In our previous post, we witnessed how the presence of stop words can generate non-existing similarities between two short phrases. This is the code we used to find Jaccard similarity for our phrases, adding a timing element to check that the performance of the similarity calculation is maintained when removing stopwords:

!pip install unidecode
import pandas as pd
import re
import unidecode
import json
from google.colab import files
import time

CNAE = pd.read_excel('/content/CPV_CNAE_EN.xlsx', 'CNAE_EN', dtype=str)
CPV = pd.read_excel('/content/CPV_CNAE_EN.xlsx', 'CPV_EN', dtype=str)

def jaccard_similarity(A, B):
    #Find intersection of two sets
    nominator = A.intersection(B)
    #Find union of two sets
    denominator = A.union(B)
    #Take the ratio of sizes
    similarity = len(nominator)/len(denominator)
    
    return similarity

def string_to_set(a_string):
  a_string = unidecode.unidecode(a_string)
  a_string = re.sub(r'[^A-Za-z0-9 ]+', '', a_string)
  return set(a_string.split())

by_jaccard = {}
for target_cnae in CNAE['TITULO_CNAE2009']:
  start = time.time()
  target_set = string_to_set(target_cnae)
  best_similarity = 0
  most_similar = None

  for cpv in CPV['Descripción 2003']:  
    cpv_set = string_to_set(cpv)
    similarity = jaccard_similarity(target_set, cpv_set)
    if similarity > best_similarity:
      best_similarity = similarity
      most_similar = cpv

  cnae_code = CNAE['COD_CNAE2009'][CNAE['TITULO_CNAE2009'] == target_cnae].values[0]
  match = CPV['Código CPV 2003'][CPV['Descripción 2003'] == most_similar].values

  
  print('----------------------------')
  if len(match) != 0:
    print(f'CANE {cnae_code}: {target_cnae}')
    print(f'CPV {match[0]}: {most_similar}')
    by_jaccard[cnae_code] = match[0]
  else:
    print(f'CANE {cnae_code}: {target_cnae}')
    print('No Match.')
    by_jaccard[cnae_code] = None
  print(f'Jaccard similarity: {best_similarity}')
  print('Elapsed Time: ', time.time()-start)

with open('by_jaccard.json', 'w') as fp:
    json.dump(by_jaccard, fp)

files.download('by_jaccard.json') 

We are using the same file from our previous translation into English of Spanish Common Procurement Vocabulary (CPV) and Economic Activity Classification (CNAE):

CPV_CNAE_EN
.xlsx
Download XLSX • 201KB

Several tools are available to remove stopwords from our similarity analysis. Using the Natural Language Toolkit for Python (NLTK for shorts) is one of the most straightforward solutions. Gensim and SpaCy are also valid alternatives, too complex and powerful for a first pass on similarity analysis. So, to use the NLTK stop words removal tool, we will need to download the corpus of stopwords:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

We can print the stopword corpus to inspect them; these are the words without substantive meaning:

print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',...

Adding the removal of stopwords to our string processing function, we can obtain a set of words in the text that include only those words with substantive meaning:

a = CNAE.iloc[0]['TITULO_CNAE2009']
a = re.sub(r'[^A-Za-z0-9 ]+', '', a)  
stop_words = set(stopwords.words('english'))   
filtered = [w for w in a.split() if not w.lower() in stop_words]
print(filtered)
['Cultivation', 'cereals', 'except', 'rice', 'legumes', 'oilseeds']

Using this approach, we modify our string-to-set function to keep meaningful words only:

def string_to_set(a_string,  stop_words=set(stopwords.words('english'))):
  a_string = unidecode.unidecode(a_string)
  a_string = re.sub(r'[^A-Za-z0-9 ]+', '', a_string)    
  filtered = (w for w in a_string.split() if not w.lower() in stop_words)
  return set(filtered)

In the function above, the stop words set is defined as a default value to prevent repeated computations of the relatively expensive "setting" of the stop words list. Of course, the following code is equivalent and also performs slower if found inside a for loop:

def str_to_set_slow(a_string, stop_words=set(stopwords.words('english'))):
  a_string = unidecode.unidecode(a_string)
  a_string = re.sub(r'[^A-Za-z0-9 ]+', '', a_string)    
  filtered = (w for w in a_string.split() if not w.lower() in stop_words)
  return set(filtered)

Altering our initial similarity calculation to make use of our newer stop words removing function yields the following code; we will pass the dataframes and column to match as function arguments this time:

def obtain_similarity(df_a, text_a, code_a, df_b, text_b, code_b):
  by_jaccard = {}
  for target_cnae in df_a[text_a]:
    start = time.time()
    target_set = string_to_set(target_cnae)
    best_similarity = 0
    most_similar = None

    for cpv in df_b[text_b]:  
      cpv_set = string_to_set(cpv)
      similarity = jaccard_similarity(target_set, cpv_set)
      if similarity > best_similarity:
        best_similarity = similarity
        most_similar = cpv

    cnae_code = df_a[code_a][df_a[text_a] == target_cnae].values[0]
    match = df_b[code_b][CPV[text_b] == most_similar].values

    
    print('----------------------------')
    if len(match) != 0:
      print(f'Text A {cnae_code}: {target_cnae}')
      print(f'Text B {match[0]}: {most_similar}')
      by_jaccard[cnae_code] = match[0]
    else:
      print(f'Text A {cnae_code}: {target_cnae}')
      print('No Match.')
      by_jaccard[cnae_code] = None
    print(f'Jaccard similarity: {best_similarity}')
    print('Elapsed Time: ', time.time()-start)

  with open('by_jaccard.json', 'w') as fp:
      json.dump(by_jaccard, fp)

  files.download('by_jaccard.json') 

Running the similarity calculation generates the result we where are looking for, stationery for stationery and not beef products as in our previous post:

This is a minor improvement in the similarity calculation of these two sets of product and activity descriptions. Vectorizing the words may not help much as it is based on paradigmatic relationships and not necessarily pure synonymous meaning. Improving this model will involve using these semantical similarities.


Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or intelligence gathering from satellite, drone, or fixed-point imagery. Also, check our AI-Powered Spanish public tender search application using sentence similarity analysis to provide better tender matches to selling companies.


Our www.contratacionfacil.es uses this CNAE-CPV matching to present better public contract tender search results.


The notebook for this demonstration is in this link.

2 views0 comments

Recent Posts

See All