top of page

Quick Semantic Segmentation From Youtube Streams on Colab

Continuing our video-stream and Google Colab interaction, we will set up a rapid method for semantic segmentation of video streams using youtube as a source. We will reuse our code from this post and this other post. This segmentation is nowhere close to being "real-time" as we have too much network trafficking of video and inferences; it serves as the basis demonstration for an ad-hoc video segmentation for sources we have available in our local network.

We are using the webcam overlay method to display the segmentation, our fixed youtube utilities to stream a video feed into Colab, and we will modify a version of Sony Nnabla Neural Network Libraries to perform a quick segmentation with a pre-trained model. We are fixing Nnabla on the fly as it seems not to return the detected frame segmentation image; it is just saving it, so we change the segmentation saving function into a segmentation frame producing one:

!pip install -q nnabla
!pip install -q nnabla-ext-cuda110

offender = '/usr/local/lib/python3.7/dist-packages/nnabla/models/semantic_segmentation/'
error = '''imsave(image_path, vis)'''
fix = '''return vis'''

# Read in the file
with open(offender, 'r') as file :
  erroneous_code =

# Replace the target string
correct_code = erroneous_code.replace(error, fix)

# Write the file out again
with open(offender, 'w') as file:

Originally the last line in the file saves the visualization, vis variable, into a file path; after the change, the function returns the visualization itself; this visualization is an image frame containing the segmentation pixels. Even if not highly meaningful, the segmentation image is something like this, let google search algorithms make sense out of this one; we, mere humans, can say it is the semantic segmentation pixel set for a car:

Semantic segmentation overlay.

In any case, we will not save such images; we will overlay them into a video feed to identify objects. We will randomly switch over various youtube live-cam feeds present in the livestreams tuple for demonstration purposes. Of course, switching video feeds multiple times damages the detection frequency; it is already very low as we are very inefficiently accessing remote images and producing web-browser representations repeatedly. The detection performance could be increased to whatever frame per second performance we need using local network video or adding more processors, that is, increasing the hardware usage. There is no hard FPS limitation beyond network speed and available computing GPU or TPU cores. We will also add the text of detected elements to our image even if, in this state, it will not display the correct color for each detection:

import random

def overlay_segmented_yt(image, output_image):

  # Some video feeds, choose one or let the machine pick one 
  # at random:
  livestreams = ('Be7OPScZz0s', 'St7aTfoIdYQ', 'z4WeAR7tctA',
                 '3qdEMXmwTkQ', 'vvOjJoSEFM0')
  url = random.choice(livestreams)
  video =
  best = video.getbest(preftype="mp4")
  stream = cv2.VideoCapture(best.url)
  ret, frame =
  size = (600, 800)

  frame = cv2.resize(frame, size)
  frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

  segmented_image, detection_text = segment_image(frame)
  segmented_image = cv2.normalize(segmented_image, None,
                                  alpha = 0, beta = 255,
                                  norm_type = cv2.NORM_MINMAX,
                                  dtype = cv2.CV_32F)

  added_image = cv2.addWeighted(frame, 0.75,
                                segmented_image, 0.25,
                                0.0, dtype=cv2.CV_32F)

  output_image[:, :, 0:3] = added_image
  output_image[:, :, 3] = 1

  # Add the detection text.
  org = (50, 50)    
  fontScale = 0.5
  color = (255, 0, 0)
  thickness = 1
  for text in detection_text:
    output_image = cv2.putText(output_image, text, org, font, 
                      fontScale, color, thickness, cv2.LINE_AA)
    org = (org[0], org[1]+12)

  # Add our logo if present:
    logo_file = '/content/ostirion_logo.jpg'
    img = cv2.imread(logo_file)
    new_size = (50, 50)
    img = cv2.resize(img, new_size, interpolation = cv2.INTER_AREA)
    lim = -new_size-1
    output_image[lim:-1, lim:-1, 0:3] = img
    output_image[lim:-1, lim:-1, 3] = 1

  return output_image

We use the previously fixed Nnabla module for this example as it offers a relatively quick method to download a pre-trained model and quickly generate semantic segmentation. Any other similar library, module, or tool could be used:

#Import required modules
import nnabla as nn
from nnabla.utils.image_utils import imread
from nnabla.models.semantic_segmentation import DeepLabV3plus
from nnabla.models.semantic_segmentation.utils import ProcessImage
from nnabla.ext_utils import get_extension_context

The segmentation function, owing to our modifications, needs to return a list of texts containing the detections; this is required as OpenCV cannot insert new lines when adding text to an image:

target_h = 800
target_w = 600 
nn.set_default_context(get_extension_context('cudnn', device_id='0'))
# Build a Deeplab v3+ network, for example.
deeplabv3 = DeepLabV3plus('voc-coco', output_stride=16)
x = nn.Variable((1, 3, target_h, target_w), need_grad=False)  
y = deeplabv3(x)

def segment_image(input_image):  
  image = input_image
  # preprocess image
  processed_image = ProcessImage(image, target_h, target_w)
  input_array = processed_image.pre_process()

  # Compute inference
  x.d = input_array
  output = np.argmax(y.d, axis=1)

  # Apply post processing
  post_processed = processed_image.post_process(output[0])

  #Display predicted class names
  predicted_classes = np.unique(post_processed).astype(int)

  # We need several text lines:

  segment_text= ['']
      for i in range(predicted_classes.shape[0]):
        label = deeplabv3.category_names[predicted_classes[i]]
        segment_text.append(f'Detected: {label}')
    segment_text.append('Label error')  

  # It will not save any image, we have modified Nnabla on the fly:
  output_image = processed_image.save_segmentation_image("segmented.png")
  return output_image, segment_text

We are now ready to run our detection loop; the intricacies of JavaScript may result in a white or blank fixed image; just rerun the cell to clear any previous script running on the active cell:

# You may need to restart the Colab Environment at this point.
# If you see a blank image wait for a couple of detection cycles as the 
# frequency of detection is around 4 seconds.
# if you see a white-only window, rerun this cell.

label_html = 'Capturing Youtube Stream.'
img_data = ''

while True:
    js_reply = take_photo(label_html, img_data)    
    if not js_reply:

    image = js_reply_to_image(js_reply)
    drawing_array = get_drawing_array(image,
    drawing_bytes = drawing_array_to_bytes(drawing_array)
    img_data = drawing_bytes

The result is a very slow detection similar to this. Nnabla appears to hallucinate airplanes constantly:

These detections could now be saved for analysis or even returned as a new feed for a constant segmentation stream, in this case, with a very low frame rate.

Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or intelligence gathering from satellite, drone, or fixed-point imagery.

The demonstration notebook is here.

30 views0 comments


bottom of page