top of page

Reading Google Patents From Google Colab

When accessing patents of the invention, the data to be managed can become complex and challenging to interpret. Google Patents offers a good patent checking page, limited to non-computerized searches. If we try to consult multiple pages automatically, we will receive a "non-human user" IP block momentarily. Fine, it is in their condition user agreement. What if Google itself is accessing its servers with robots? Let us check if we can launch a more frequent set of requests using their own Colab tools. The Colab Notebook for this post will be at the end of the post.

We will start with a trial search of Google Patents results to check the contents of the web page display. For this, we will need these tools that we import:

import requests
import re
import datetime

As a trial search, we are searching for all patents with a priority date between to given dates; we construct the URL matching that request and request the returned data as text:

start = '20211126'
end = '20211128'
url = f'{end}&after=priority:{start}'
data =

This request does not show the information we need; it only shows the HTML rendering information. To find where the patent information contents are coming from into the rendered page, we need to inspect the launched requests. For example, entering developer mode pressing F12 in Chrome shows us where the XHR query is. You may have to press F12 on the requested page and then reload the page pressing F5 for this information to appear:

The XHR request address shows us the structure for the data request that we can adapt to our needs by creating a similar flexible layout:

main_url = ''
pre = 'before%3Dpublication%3A'
post = '%26after%3Dpublication%3A'
pub_type = '%26type%3DPATENT'
closure = '&exp='
xhr =  f'{main_url}{pre}{end}{post}{start}{pub_type}{closure}'
data =

This structure can be packed into a request generator to create requests in a simplified manner. For example, this function will generate a search request given a range for dates of publication:

def make_request(start_date, end_date ):
    start = start_date.strftime("%Y%m%d")
    end = end_date.strftime("%Y%m%d")
    main_url = ''
    pre = 'before%3Dpublication%3A'
    post = '%26after%3Dpublication%3A'
    p_type = '%26type%3DPATENT'
    closure = '&exp='
    xhr_request =  f'{main_url}{pre}{end}{post}{start}{p_type}{closure}'
    return xhr_request  

With the generator, we can now, for example, traverse the calendar starting 90 days ago, searching for the most relevant patent publications. Note that this process will initially yield the first ten most relevant results only:

date = - datetime.timedelta(days=90)
for i in range(90):
    start = date    
    end = start + datetime.timedelta(days=2)
    xhr_request = make_request(start, end)
    data =
    data = data.replace('"', "-")
    data = data.replace('/', "-")
    found = re.findall(r'-id-:-patent-(.*?)-', data)
    print(f'Found {len(found)} patents between {start} and {end}:\n {found}')
    date = date + datetime.timedelta(days=1)

The patent identification codes can now be stored for further automated processing.

There is a one-month delay between the official publication and its logging by Google Patents. When accessing patent publications, official publication services may be better suited if time is of the essence. Depending on the rate of data access, official intellectual property institutions offer free-of-charge API access. Sooner or later, Google will block automated access to the patent search engine to our Colab Notebook; make sure your access rates and work are compatible with Google Patents' terms of service.

Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or intelligence gathering from satellite, drone, or fixed-point imagery.

The link to the notebook is here.

48 views0 comments


bottom of page