First steps with ML.NET — object detection realm

14 min readAug 17, 2021

ML.NET is a free machine learning library from Microsoft to power the .NET ecosystem with machine learning capabilities. Not only it allows for easy integration with the existing machine learning models, but it is also a powerful toolset for the creation of own reusable models.

What will you learn by reading this article?

As a .NET developer, you will be able to write C# code that uses ML.NET in order to perform specific machine learning tasks — text sentiment analysis and object detection. You will learn how to use an existing, pre-trained machine learning model (YOLO v4) that is capable of detecting objects of certain types (types are called “classes” in ML terms) from an arbitrary image.

Some theory

Machine learning is a subset of artificial intelligence (AI), an area of scientific study. While AI deals with making computers mimic human capabilities, machine learning is a subset of AI that enables computers to improve on the tasks they are given without being explicitly programmed for each improvement. Machine “learns” as it does its work. Deep learning, in turn, is a subset of machine learning that deals with the creation of deep learning models, which use multiple computational layers in the network architecture. As we won’t go deeper into the ML theory in this article, I suggest your read on those terms elsewhere for a deeper understanding. However, this is not required to start practical experiments with deep learning as you will see in the following paragraphs.

Machine learning models use historical data to predict correct outputs based on the new inputs, the model “has never seen before”. If the model is trained on a lot of photos of cats while being explicitly told that “this is a cat”, the model can be given any photo it wasn’t trained on and with relative confidence, it will try to guess, if there is a cat on that photo or not.

One final boring (read — non-practical) piece: the difference between machine learning algorithm and machine learning model:

ML algorithm is an algorithm that is being applied to data.
ML model is a program that takes inputs and predicts outputs.

ML algorithms + Data = ML Model.

If a model is a program, then what is its format?

Machine learning models can be created in different formats. Usually, the format is dictated by the software which is used to create the model, with the most notable examples being: TensorFlow, Keras, Caffe, Torch.

ONNX ecosystem logo

There is a model format created specifically for cross-tool compatibility — ONNX (read as Onix). ONNX is a common platform, most of the ML model formats can convert from and into ONNX. One can train algorithms into an ONNX model directly. For our computer vision-related demo’s we will be consuming a popular model called YOLOv4 that is as well provided in ONNX format by its developers.

A simple enough example of ML.NET

I suggest one goes through a simple ML.NET tutorial by Microsoft that can be found here. It will walk you through an installation of the dependencies and provisioning of the first ML.NET-enabled project. At the end of the tutorial, you have an auto-generated solution with two projects — class library and console application projects. In my case those are:

A simple auto-generated solution with classification task implementation sample

The generated solution provides the functionality of guessing a sentiment of the text comment, which is given as an input parameter. This type of ML task is called a Classification task — a program classifies if the given text represents a positive or a negative comment.

By the way, if you, for some reason, prefer not to follow the tutorial, you can simply get the source code being reviewed from GitHub.

There are several classes generated here that are of interest to us, however, as this article isn’t intended to go into details on how one can train/create a model, we will focus on model consumption. I am convinced that the best way to embrace new technology is by using it initially, and only then inspect how it is engineered. ModelBuilder.cs in SampleClassification.ConsoleApp is a generated class that shows how a model can be created and saved based on the automatically selected set of ML algorithms. We will cover this aspect in depth in the subsequent articles and we will not review model creation logic here.

SampleClassification.Model project hosts another class called ConsumeModel. This class exposes the functionality that other classes can “use” to work with the generated model. Let’s break it down to see how it works:

ConsumeModel itself is just a wrapper class that exposes the Predict method:

As we will learn along the way, typical consumption of an ML model in code involves having an input and output types defined that enable code to represent the parameters that are passed into a model and the format of the output that the model produces after processing the inputs. POCO classes ModelInput and ModelOutput in the generated solution are exactly those types in our demo case.

2. The Predict method takes an instance of ModelInput class as a parameter, does “ml magic” and returns an instance of ModelOutput class.

ModelInput class looks like this:

Before we proceed, let’s quickly review what those ColumnName and LoadColumn attributes are for here. When ML.NET steps through the data transformation pipeline (it does it during training of a new model and while using it), it uses a generic container to store data being processed and passed from one step of the pipeline to another. An implementation of an interface called IDataView is being used in ML.NET to store intermediary and resultant data. Think of it as an in-memory data table that implements IDataView. Table operates with columns and rows (classical tabular data model). ML.NET operates with column names. Whenever we need ML.NET to be able to map something from IDataView to our POCO objects, we use attributes to be able to “tell ML.NET” which column to map to. [ColumnName(“col0”), LoadColumn(0)] basically means “take the value from this property, load it into column 0 (the first one), name it “col0” for future reference by ML.NET mechanisms. (since it’s an Input model, attributes tell ML.NET how to “parse” these POCO models onto a standard IDataView construct.

ModelOutput class looks like this:

In this context, [ColumnName(“PredictedLabel”)] means that our model when it produces output, was configured to output the result, and the result we are interested in mapping to a Prediction property is exactly located in the PredictedLabel column of the output. If we sneak peek into our ModelBuilder which I decided to omit in this article, we see exactly this in the model configuration:

Now that we have reviewed Input and Output models, we can finish the review of the ConsumeModel.cs.

At the top of the class, we have a property called PredictionEngine. It is a lazy-loaded object of type PredictionEngine<ModelInput, ModelOutput>. It is being used in Predict method we’ve seen earlier to perform the actual prediction when called from external application code. As you will progress learning ML.NET, you will see that it is not mandatory, but it is a very good practice for wrapping the ML.NET pipeline into an instance of PredictionEngine generic class. It serves as an encapsulating class for all the plumbing that is required to actually get the output after providing the input.

Just like any other laze-loaded object, when it is being called for the first time, the initializer method (in our case CreatePredictionEngine) is called. Let’s quickly inspect its contents:

One of the core classes in ML.NET — MLContext appears here. Microsoft documentation provides a concise description of the purpose of this class:

MLContext is like a RequestContext in ASP.NET where “everything we need to process the web request” is “mounted” onto a defined object and we interact with it. In ML.NET, MLContext is that class.

Now that we have a “home where everything can happen” (meaning MLContext instance), to create a prediction engine a pre-built Model is being loaded from the file (mlContext.Model.Load call), and finally, the prediction engine is created and returned (mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput> call).

Pretty simple, just 3 steps and we have the wrapper that takes input and gives us the desired output back. Well, there is obviously much more to this, like probabilities (how confident the model is in the given prediction being made). But more on that later. Now, let’s review how this works with visual data, not just text.

Object detection with ML.NET

A computer program being able to take any text and try to identify if its positive or negative while also providing a piece of information on “how sure” it is about the prediction is cool enough already, but let’s see how we can use ML.NET to inspect a camera input and detect objects on the fly.

In our simple text-to-sentiment prediction example above we relied on the instance of the PredictionEngine<ModelInput, ModelOutput> class to make our predictions. In the case of the computer vision task, where we want to detect objects, we could do the same, but logically, ModelInput would be an image, and ModelOutput needs to contain information about objects that have been detected by the algorithm on the provided image. Well, it turns out we can do just that.

Here is a wider diagram that includes .NET classes involved in a “plug-and-play” ML.NET implementation we could use as an example:

Before we break down the proposed class hierarchy, let me briefly explain the term YOLO and which computer vision model we will use in our practical example.

In Machine Learning, YOLO is a popular abbreviation that refers to “You Only Look Once”. It is a name of an object detection algorithm that has gained fame due to its performance and simplicity. The “look once” part refers to the essential aspect of the algorithm being used, which is — an object detection is performed via a single forward propagation through a neural network. The predictions in the entire image happen during a single algorithm run — thus the performance gain. Convolutional Neural Network is used to predict various class probabilities and bounding boxes simultaneously. YOLO is also known to have great learning capabilities which we will review in a subsequent article.

YOLO has seen several of its implementation versions published over the recent years. We will now focus on an “already outdated” YOLOv4 by Alexey Bochkovskiy. One can easily download the ONNX version of this model from ONNX GitHub repo here. (you will need it if you are to run my demo project locally).

But let’s get back to the .NET code!

Our consuming code can be anything — a web app, a desktop app, or an API implementation. As long as the “consuming code” can feed our object detector an image, we should be able to detect and classify objects on it.

In my diagram above IPredictionProvider and the YOLOPredictionProvider are generic wrapper classes, a syntactical sugar to nicely wrap the actual “meat” of the ML.NET implementation, which is, similarly to our first demo project, relies on MLContext to do all of the work. Let’s inspect them more closely:

IPredictionProvider is an interface that our “consuming code” will be using:

YOLOPredicionProvider is the implementation of IPredictionProvider. The essential part of the “setting up” happens in the class constructor. Please refer to the source code for proper inspection, file name is YoloV4PredictionProvider.cs, lines 33–63.

Yeah, I know you can’t read well from that screenshot, it’s here to outline where to look at in code

Lets dissect the code that implements the provider initialization and configuration. MLContext instance is created. Next line assigns a object of type EstimatorChain<OnnxTransformer>. As the name suggests, we are creating a chain of transformers that will be processing out input data in order to arrive with an output which is our prediction.

First estimator in the chain would be ResizingImageEstimator, a class that takes the image from the IDataView (remember we met this guy in a previous example?), resizes it and places the resized image onto a column with a name specified in outputColumnName parameter — “input_1:0” in our example. So it takes and image and returns an image (a bitmap).

Next one in line is the ImagePixelExtractingEstimator added via .Append(mlContext.Transforms.ExtractPixels…) call. This is an estimator that takes an image and converts it into a̶n̶ ̶a̶r̶r̶a̶y̶…sorry, a vector.

The final and the most complex .Append statement adds an OnnxScoringEstimator, the one that actually does the main workload — attempts to identify objects an an input image. Remember, at this point image is already converted into a numeric vector.

Let’s review how we configure the OnnxScoringEstimator with the parameters:

shapeDictionary is a configuration parameter that allows model users (us, developers, in this case) to specify the “dimensional configuration” of the model’s input and output parameters. For simplicity’s sake I suggest you just need to know which shapes are fit for which model by learning it from the model’s documentation.

inputColumnNames and outputColumnNames parameters allow us to specify the name from the input and output(s) after the transformation is performed. Basically we define the naming to be used in the IDataView. It is a normal practice to investigate the Onnx model “signature” (e.g. what it takes as an input and what it produces) via a visual tool like Netron or via code. Netron renders YOLOv4 Onnx model we are using like this:

It can tell us what are the names of the inputs and the outputs model expects and produces. It is fine to have those correspond in the IDataView mapping. Netron also reveals the format of the tensor shape, with unk_ representing an unknown token which the decoder couldn’t process. We must acquire a supplementary documentation and usage examples from the model authors since, as the example shows, not all of the parameters are “auto-discoverable” via utilities like Netron.

modelFile parameter is self-explanatory — it allows us to specify where the .onnx model is on the volume, so that our ml.net runtime can find it and interact with it.

gpuDeviceId defines which GPU within a system to use when working with the model, null in our case means that we are not intending to use a GPU.

fallbackToCpu controls if the OnnxRuntime will try to do the work on the CPU instead of a GPU.

Once the EstimatorChain is created, we need to create a TransformerChain. Remember, out ultimate goal is to create a correctly pre-configured PredictionEngine instance. We are almost there.

Transformer is obtained by providing an input model (a POCO class) to the EstimatorChain within a .Fit(.. method call.

We are “fitting” the input model we are to supply to an estimator chain in order to “prepare” the chain and acquire a TransformerChain. Estimators are algorithms, which, when “fitted” with a DataView, produce a Transformer. Transformers are algorithm capable of producing a new DataView entry as an output, while being provided a DataView as an input. For instance Transformer takes a DataView with and image, and outputs a DataView with predictions. YoloV4BitmapData is a C# class that will be a part of IDataView when it is supplied for prediction.

One final step to generate a PredictionEngine is to call MLContext.Model.CreatePredictionEngine<TSrc, TDst> and supply both input and expected output structures, again, a pure C# classes in ML.NET case.

Let’s discuss what POCO classes we use as inputs and outputs in PredictionEngine. YoloV4BitmapData is trivial:

However the output class, YoloV4Prediction is not.

A few short comments on the most vivid elements of the prediction class.

Anchors are the common aspect in object detection. Their use is not limited to YOLO set of algorithms. Anchors define the scale and the form using which the ML algorithm will try to identify objects on the input image. Fine tuning the anchors enables a model to be more effective at predicting objects. In our case, we just know which anchors are used with YOLOv4 and map those in .NET types.

Strides are also a common concept, defining the “step” between bounding boxes. Since the detection of objects relies in YOLOv4 on a level of concentration of the potential bounding boxes determined by each “predictor” within a model, stride configuration also influences if some objects are detected or skipped during the detection. In our “Results parsing class” strides are required to derive a bounding box that can be overplayed to an original input image.

XYScale parameter is used to be able to determine the real bounding boxes of an object that has been detected, given that the prediction output from the model is originally can’t be mapped 1-to-1 to an input image. Since we want to draw a detected object box on an input image, we need to know this value for the model.

Shapes variable is also predefined for YOLOv4 and is used during results parsing.

The GetResults of the YoloV4Prediction class is where the actual “parsing” of the model output happens, so at the end we can make “normal C# bounding boxes” for predicted objects available to the calling application code.

After our code in GetResults parses the “model-native” prediction information, we get a rather clean-looking POCO class instance — an instance of YoloV4Result class:

With the properties of that class we can easily “use” that data in the application, it basically tells us where the detected objects are, what is the class label (what type of the object is detected) and how confident the model was about this prediction.

While there still is a part of the demo application that we haven’t broken down into details, I believe it is a rather trivial part of objects and data manipulations that are clear to a .NET engineer. As for ML part, I think we have covered the practical aspects of using a third-party ML model for object detection and how it maps to ML.NET primitives.

Conclusions

ML.NET 1.6 is already a rather mature ML framework. It allows us to natively automate a variety of ML tasks, stick to the managed code and in the recent version even choose if we want to bind our work to the GPU or the CPU.

In a future article I will try to focus on enhancing an object detection model with own training data using ML.NET.