VideoDB Documentation
VideoDB Documentation
Building Intelligent Machines

icon picker
Part 3 - Training a Model

Ashutosh Trivedi
In we discussed human decision making and understood the principle of “observe and respond” – humans observe their environment through their senses and respond according to their goals. By this principle they gain more decision making power and are able to interact with their environment on a larger scale.
We want to create machines with capabilities to interact with the real world. We could apply the principle of observe and respond. To start with this process we can set simple goals for a machine. But before that, we have to solve another problem — How are machines going to observe the real world? They don’t have any of the five senses by which they can gather information. So what are they going to observe? You might have guessed it right – Data.
We write, speak, click pictures, record videos and also leave our digital footprints everywhere on the internet. For example, when you visit all your actions are stored as data — from your purchase to the products you looked at, and even tracking your cursor on the screen.
To create smart machines, we require dumb machines to gather data for observation. In order to remove human hands from the steering wheel, we need a mechanism to collect data about driving. This data can be gathered through images, videos and sensors such as radars and sonars.
We also need a way to convert this data into a form, on which machines can go through the principle of observe and respond. Which brings us to the problem of information transfer.


Machines are good at computing and they are going to observe the data. But this is not as simple as it sounds. Most of the data gets generated in the form of written language, oral sounds and moving or static pictures. Machines do not have any knowledge or information to infer language or images.
A machine’s world is a computational world and we will have to transform our real world information into computable units.
Let’s first discuss the computational world and later we’ll see how can we transform the information.
Computers are computing machines – they can do fast calculations and a range of mathematical operations. They are also capable of handling linear algebra operations. Linear Algebra is a field of mathematics where we can operate on vectors. A vector is nothing but a point in space. Here is a vector “O” in two-dimensional space.
It can be described or uniquely identified by two characteristics. Here these characteristics are x and y co-ordinates. The value of these characteristics are ( a, b ) for O.
Similarly we can have vectors in a 3-d space. A point in 3-d space can also be shown in 2-d computer screen using some tricks. It can be described uniquely by x, y and z co-ordinates. These x, y and z dimensions are not necessarily spatial. They can be any characteristic which can describes a point uniquely for a specific domain.
For example, this is a dataset about cherry trees. Each tree is described by three characteristics. Girth, Height and Volume. We can also see this data in 3 dimension space. Tree at index 1 is a vector – [8.3, 70, 10.3]. Each blue dot in this chart uniquely describes a tree. Each blue dot is a vector in 3-d space.
Computers can easily operate on more than 3-d vectors. It’s just hard to visualize more than 3-d for humans, but nothing to worry, linear algebra gives us all the power to operate on multi dimension vectors. If we have information in numerical form like this example, we don’t have much challenge. We can easily represent the information as vectors. But what about sound, text, image and videos? They are a large part of the real world information.


So now it’s our responsibility to transform real world information into vectors in such a way that we don’t lose much of it.
To understand this process let’s play a game of training a machine.
These two cute creatures don’t need any introduction.
We have to build a machine with decision making power of identifying a cat and a dog. So, if you show a cat’s picture to this machine, it should be able to tell you that “Yeah, it’s a cat” with certain accuracy.. Even if the machine has never seen that particular cat image, Same goes for dogs. If you show any other image which is neither a cat nor a dog, the response should be others.
So, our decision making machine can decide by looking at the image and produce one of the three responses — cat, dog, others.
To have this decision making power, this machine has to go through the principle of observe and respond, observe lots of cat and dogs images. As we discussed in , the outcome of this process is “learning”. Once learned, it should be able to identify the right response on any image — Just like you were able to identify an unknown tree even if you had never seen it. You went through a principle of observe and respond.
Coming back to information transfer – How are we going to transfer the information in the images to the computer?
For this, you will have to think like a computer – Computers can understand numbers and compute them. So, the ONLY answers it can understand are
1. Yes/ No – e.g Does this image have a tail? Answer is Yes/ No
2. Numbers – e.g What is the length of the tail? Answer is 0.4
3. Categories – e.g Out of 5 colors (White, Brown, Black, Blue, Green), what is the color of the tail? Answer is Black.
See how specific the questions have to be to get an answer in a form that computers can understand? We, humans, don’t think this way and hence the challenge – how do you give a numeric answer to the question “Is this a dog or a cat?”
Programmers have to create such answers, even the 5 categories of color in the question above.
So, what questions would you ask to determine if the image is Cat or a Dog?
Let me give a few hints , you can ask stuff like —
Length of the tail ?
Radius of the eyes?
Number of legs ?
Can you think of more such questions or attributes which can describes an image for cat and dog problem? Grab your notepad and write them down.

Okay, if you are done writing down your questions, they must be very close to these
questions. These questions are nothing but attributes about any image which you can pass on to computer.
So, for the computer, an image is described by attributes.These attributes are also called Features. Keep in mind that some of these features are too hard to compute in real life for a programmer but for now let’s assume that it’s possible. Since we had to have questions with numerical and categorical answers, these features can be used to form a vector.
Using these features we can describe any image in the vector space. In vector space, features are dimensions. Here, I am only able to plot 2 dimensions (f1 & f2). But you can easily imagine that each image is now converted into a point in space (a vector ) of 7 dimensions (f1 to f7)
This process is called feature engineering and vectorization. It is one the most important subdomains of machine learning. It is a vast field and we will explore these topics in coming posts.
So, now we are one step closer to the final phase of this problem – “Learning”. We have (somewhat) solved the problem of information transfer by communicating a reality to a computer.
Our process is not very optimal though. You must be realizing that information transfer in this way is not accurate and we are loosing lot of essential information. Imagine yourself looking at these attribute values to predict which vector is cat and which one is dog. It’s hard.. isn’t it?


Coming back to the problem, we have our observations as vectors. We also have one more important information with us — The label of each image.
In our case the labels are Cat, Dog and Other. We can use these labels to supervise the responses of the machine. Now, finally, our images are vectors. Here on, we are going to do little bit of computation and believe it or not, it won’t be anything fancier than plain addition. We can computationally represent this labeled information in the following way:
Let’s take 1000 images to learn from. For each cat, dog and other image, we know the values of Fs which are the attributes represented as a vector — ( F1, F2, ….F7 )
We also have the labels. We represent this information as +1 for cat, 0 for others and -1 to represent the dog. This labeled set is called the training dataset and because the labels are already provided, this process is called Supervised Machine Learning.
We have Fs and Ws in these 1000 equations. Each image is now a simple equation with known Fs and unknown Ws. Now, to solve these equations to give an answer of -1, 0 or +1, we need to find the unknown values of Ws.
Finding out all the 7000 Ws is not possible and we don’t need to. What if, we find out the optimal value/s of Ws which can fit in all the 1000 equations and give the answer on right side most of the time?
These optimal values of Ws should not only fit the 1000 equations we have, but also the unknown images (or equations) it has never seen. These optimal values of the Ws — ( W1, W2, …W7 ) are called the Machine Learning model.
For example, if we pass any other image which we have never provided earlier for training, it should be able to tell the answer ( +1, -1, 0 ) with very high accuracy by using the model ( optimal value of Ws ).
These optimal value of Ws are found by some special optimization algorithms. These algorithms not only minimize the current error or the known error, but also the error on unseen images. As a principle, error of the unseen future is called risk. Precisely these algorithms are minimising the risk of identifying a cat, or a dog.
These risk minimizing algorithms are one of the type of Machine Learning algorithms which comes under the principle of
( Emperical Risk Minimization )
If you have survived this far, congratulations! you have just trained a Machine Learning( ML ) model theoretically. In upcoming posts of this series, I will explain a few of these Machine Learning algorithms. It might require a little bit of mathematical background, but I will try my best to explain it through intuitive process.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.