Dotnet by Example: Labeling Toy Aircraft in 3D space using an ONNX model and Windows ML on a HoloLens

29 January 2019

Labeling Toy Aircraft in 3D space using an ONNX model and Windows ML on a HoloLens

Intro

Back in November I wrote about a POC I wrote to recognize and label objects in 3D space, and used a Custom Vision Object Recognition project for that. Back then, as I wrote in my previous post, you could only use this kind of projects by uploading the images you needed to the model in the cloud. In the mean time, Custom Vision Object Recognition models can be downloaded in various formats - and one of them in ONNX, which can be used in Windows ML. And thus, it can be used to run on a HoloLens to do AI-powered object recognition.

Which is exactly what I am going to show you. In essence, the app still does the same as in November, but now it does not use the cloud anymore - the model is trained and created in the cloud, but can be executed on an edge device (in this case a HoloLens).

The main actors

These are basically still the same:

CameraCapture watches for an air tap, and takes a picture of where you look
ObjectRecognizer receives the picture and feeds it to the 'AI', which is now a local process
ObjectLabeler shoots for the spatial map and places labels.

As I said - the app is basically still the same as the previous version, only now it uses a local ONNX file.

Setting up the project

Basically you create a standard empty HoloLens project with the MRTK and configure it as you always do. Be sure to enable Camera capabilities, of course.

Then you simply download the ONNX file from you model. The procedure is described in my previous post. Then you need to place the model file (model.onnx) into a folder "StreamingResources" in the Unity project. This procedure is described in more detail in this post by Sebastian Bovo of the AppConsult team. He uses a different kind of model, but the workflow is exactly the same.

Be sure to adapt the ObjectDetection.cs file as I described in my in my previous post.

Functional changes to the original project

Like I said, the difference between this project and the online version are for the most part inconsequential. Functionally only one thing changed: in stead the app showing the picture that it took prior to starting the (online) model, it now sounds a click sound when you air tap to start the recognition process, and sounds either a pringg sound or a buzz sound, indicating the recognition process respectively succeeded (i.e. found at least toy aircraft) or failed (i.e. did not find an toy aircraft).

Technical changes to the original project

The ObjectDetection file, downloaded from CustomVision.ai and adapted for use in Unity, has been added to the project
CustomVisonResult, containing all the JSON serialization code to deal with the online model, is deleted. The ObjectDetection file contains all classes we need
In all classes I have adapted the namespace from "CustomVison" *cough* to "CustomVision" (sorry, typo ;) ).
The ObjectDetection uses root class PredictionModel in stead of Predition, so that has been adapted in all files that use it. The affected classes are:

ObjectRecognitionResultMessage
ObjectLabeler
ObjectRecognizer
PredictionExtensions

Both CameraCapture and ObjectLabeler have sound properties and play sound on appropriate events
ObjectRecognizer has been extensively changed to use the local model. This I will describe in detail

Object recognition - the Windows ML way

The first part of the ObjectRecognizer initializes the model

using UnityEngine;
#if UNITY_WSA && !UNITY_EDITOR
using System.Threading.Tasks;
using Windows.Graphics.Imaging;
using Windows.Media;
#endif

public class ObjectRecognizer : MonoBehaviour
{
#if UNITY_WSA && !UNITY_EDITOR
    private ObjectDetection _objectDetection;
#endif

    private bool _isInitialized;

    private void Start()
    {
        Messenger.Instance.AddListener<PhotoCaptureMessage>(
          p=> RecognizeObjects(p.Image, p.CameraResolution, p.CameraTransform));

#if UNITY_WSA && !UNITY_EDITOR
        _objectDetection = new ObjectDetection(new[]{"aircraft"}, 20, 0.5f,0.3f );
        Debug.Log("Initializing...");
        _objectDetection.Init("ms-appx:///Data/StreamingAssets/model.onnx").ContinueWith
            (p =>
            {
                Debug.Log("Intializing ready");
                _isInitialized = true;
            });
#endif
    }

Notice, here, too the liberal use of preprocessor directives, just like in my previous post. In the start of it's method we create a model from the ONNX file that's in StreamingAssets, using the method I added to ObjectDetection. Since we can't make the start method awaitable, the ContinueWith needs to finish the initalization.

As you can see, the arrival of a PhotoCapture message from the CameraCapture behavior fires off RecognizeObjects, just like in the previous app.

public virtual void RecognizeObjects(IList<byte> image, 
                                     Resolution cameraResolution, 
                                     Transform cameraTransform)
{
    if (_isInitialized)
    {
#if UNITY_WSA && !UNITY_EDITOR
        RecognizeObjectsAsync(image, cameraResolution, cameraTransform);
#endif

    }
}

But unlike the previous app, it does not fire off a Unity coroutine, but a private async method

#if UNITY_WSA && !UNITY_EDITOR
private async Task RecognizeObjectsAsync(IList<byte> image, Resolution cameraResolution, Transform cameraTransform)
{
    using (var stream = new MemoryStream(image.ToArray()))
    {
        var decoder = await BitmapDecoder.CreateAsync(stream.AsRandomAccessStream());
        var sfbmp = await decoder.GetSoftwareBitmapAsync();
        sfbmp = SoftwareBitmap.Convert(sfbmp, BitmapPixelFormat.Bgra8, 
                                       BitmapAlphaMode.Premultiplied);
        var picture = VideoFrame.CreateWithSoftwareBitmap(sfbmp);

        var prediction = await _objectDetection.PredictImageAsync(picture);
        ProcessPredictions(prediction, cameraResolution, cameraTransform);
    }
}
#endif

This method basically is 70% converting the raw bits of the image to something the ObjectDetection class's PredictImageAsync can handle. I have very much to thank this post in the Unity forums and this post on the MSDN blog site by my friend Matteo Pagani to piece this together. This is because I am a stubborn idiot - I want to take a picture in stead of using a frame of the video recorder, but then you have to convert the photo to a video frame.

The 2nd to last code actually calls the PredictImageAsync - essentially a black box for the app, and then the predictions are processed more or less like before:

#if UNITY_WSA && !UNITY_EDITOR
private void ProcessPredictions(IList<PredictionModel>predictions, 
                                Resolution cameraResolution, Transform cameraTransform)
{
    var acceptablePredications = predictions.Where(p => p.Probability >= 0.7).ToList();
    Messenger.Instance.Broadcast(
       new ObjectRecognitionResultMessage(acceptablePredications, cameraResolution, 
                                          cameraTransform));
}
#endif

Everything with a probability lower than 70% is culled, and the rest is being send along to the messenger, where the ObjectLabeler picks it up again and starts shooting for the Spatial Map in the center of all rectangles in the predications to find out where the actual object may be in space.

Conclusion

I have had some fun experimenting with this, and the conclusions are clear:

For a simple model as this, even with a fast internet connection, using a local model in stead of a cloud based model is way faster
Yet - the hit rate is notably lower - the cloud model is definitely more 'intelligent'. I suppose improvements to Windows ML will fix that in the near future. Also, the AI coprocessor the next release of HoloLens will undoubtedly contribute to both speed and accuracy.
With 74 pictures of a few model airplanes, almost all on the same background, my model is not nearly enough equipped to recognize random planes in random environments. This highlights a bit the crux of machine learning - you will need data, data more data and even more than that.
This method of training models in the cloud and executing them locally provides exiting new - an very usable - features for Mixed Reality devices.

Using Windows ML in edge devices is not hard, and on a HoloLens is only marginally harder because you have to circumvent an few differences between full UWP and Unity, and be aware of differences between C# 4.0 and C# 7.0. This can easily be addressed, as I showed before.

The complete project can be found here (branch WinML) - since in now operates without a cloud model it is actually runnable by everyone. I wonder if you can actually get it to recognize model planes you may have around. I've got it to recognize model planes up to about 1.5 meters.

2 comments:

Unknown said...: Dear Joost,

First of all thank you very much for your tutorial that has been done very comprehensive and easy to work with. I noticed that you mentionned that you could answer question on our specific task. So, here it is. I am working on project to enable Hololens to perform Object detection using tiny-yolov3 from darknet. I have found the ONNX model but the versions available are either 1.3 or 1.5. Would it be compatible with your code?
Also, if I am using a different model which scripts should I adapt?

Hope to hear from you
bests
Cyprien; April 25, 2020 at 10:58 AM
Joost van Schaik said...: Hi Cyprien,
Tbh, I am not sure. As you might have seen on a previous blog post http://dotnetbyexample.blogspot.com/2019/01/adapting-custom-vision-object.html I have used models created by Custom Vision specifically. They come with an accompanying file, ObjectDetection.cs that basically provides the interface to the model. If you don't have that interface file, you have to write or adapt it yourself. More specifically, you will have to look into this method:
public async Task> PredictImageAsync(VideoFrame image)
{
var imageFeature = ImageFeatureValue.CreateFromVideoFrame(image);
var bindings = new LearningModelBinding(this.session);
bindings.Bind("data", imageFeature);
var result = await this.session.EvaluateAsync(bindings, "");
return Postprocess(result.Outputs["model_outputs0"] as TensorFloat);
}
Where the model is "Bound" apparently to an input parameter "data" and returns a TensorFloat "model_outputs0". I have no idea what your model wants and what it outputs.

This is where you have to look. What you should exactly do with it, no idea. Without more details about the model I can't help you further than this I am afraid; April 29, 2020 at 8:11 AM

Feedback, comments and tokens of appreciation

If you spot things that are incorrect, or if you don't understand what I mean, please drop a comment on the offending article and I will help you ASAP. You can e-mail me at joostvanschaik at outlook dot com or contact me via twitter.

If you find the information on this blog useful (and apparently some 16000 people per month do so on average) please let me know as well, that encourages me to keep doing this. Or do tell others - that made me an MVP; who knows what more it might bring ;-P

About getting support

Yes, I am more than willing to answer your questions
Yes, I always read all comments and respond to them if they make sense
Yes, I am always willing to help you out if you are stuck

But…

I am not clairvoyant
I won't write the complete app for your assignment due in three days ;)
I cannot deduce from a few lines of code or short description how your 1000+ lines project looks like, what you try to do and where things go wrong.
I am just a simple developer like you who needs to look at code, bing stuff and use the debugger
I am just a regular guy with a job and my time is limited so it won’t be always today. Sometimes not even tomorrow.
English is not my first language either

But, if something does not work and you really need help and you do this:

Make a simple solution that shows your problem with as little other stuff in it as possible
Put it on a OneDrive or some other place where I can download it from
Explain as clearly as possible what your problem is

I will help you to the best of my abilities. Deal?

Disclaimer and legal stuff

Although I take great care in providing quality samples, all postings, articles and/or files on this site are provided "AS IS" with no warranties, and confer no rights. The views expressed on this blog are strictly my own and do not necessarily reflect the views of my employer, or anyone else on the planet for that matter.

I usually make original content, sometimes building upon other people's work. Sometimes I explain things that can be found elsewhere because I felt what I read was not clear enough for my limited mind so I explain it the way it finally clicked with me. In all cases I take great pains to be sure to link to people or articles who deserve the credit. If you think I have shortchanged you on the credits please let me know.

Please note, I do not work for Microsoft and while I proudly wear the title of "Microsoft Most Valueable Professional", my opinions, files offered, etc. do not represent, are approved of, endorsed by or paid for by Microsoft. The only power behind it is me and my sometimes runaway passion for parts of Microsoft's technology.