29 January 2019

Labeling Toy Aircraft in 3D space using an ONNX model and Windows ML on a HoloLens

Intro

Back in November I wrote about a POC I wrote to recognize and label objects in 3D space, and used a Custom Vision Object Recognition project for that. Back then, as I wrote in my previous post, you could only use this kind of projects by uploading the images you needed to the model in the cloud. In the mean time, Custom Vision Object Recognition models can be downloaded in various formats - and one of them in ONNX, which can be used in Windows ML. And thus, it can be used to run on a HoloLens to do AI-powered object recognition.

Which is exactly what I am going to show you. In essence, the app still does the same as in November, but now it does not use the cloud anymore - the model is trained and created in the cloud, but can be executed on an edge device (in this case a HoloLens).

The main actors

These are basically still the same:

  • CameraCapture watches for an air tap, and takes a picture of where you look
  • ObjectRecognizer receives the picture and feeds it to the 'AI', which is now a local process
  • ObjectLabeler shoots for the spatial map and places labels.

As I said - the app is basically still the same as the previous version, only now it uses a local ONNX file.

Setting up the project

Basically you create a standard empty HoloLens project with the MRTK and configure it as you always do. Be sure to enable Camera capabilities, of course.

Then you simply download the ONNX file from you model. The procedure is described in my previous post. Then you need to place the model file (model.onnx) into a folder "StreamingResources" in the Unity project. This procedure is described in more detail in this post by Sebastian Bovo of the AppConsult team. He uses a different kind of model, but the workflow is exactly the same.

Be sure to adapt the ObjectDetection.cs file as I described in my in my previous post.

Functional changes to the original project

Like I said, the difference between this project and the online version are for the most part inconsequential. Functionally only one thing changed: in stead the app showing the picture that it took prior to starting the (online) model, it now sounds a click sound when you air tap to start the recognition process, and sounds either a pringg sound or a buzz sound, indicating the recognition process respectively succeeded (i.e. found at least toy aircraft) or failed (i.e. did not find an toy aircraft).

Technical changes to the original project

  • The ObjectDetection file, downloaded from CustomVision.ai and adapted for use in Unity, has been added to the project
  • CustomVisonResult, containing all the JSON serialization code to deal with the online model, is deleted. The ObjectDetection file contains all classes we need
  • In all classes I have adapted the namespace from "CustomVison" *cough* to "CustomVision" (sorry, typo ;) ).
  • The ObjectDetection uses root class PredictionModel in stead of Predition, so that has been adapted in all files that use it. The affected classes are:
    • ObjectRecognitionResultMessage
    • ObjectLabeler
    • ObjectRecognizer
    • PredictionExtensions
  • Both CameraCapture and ObjectLabeler have sound properties and play sound on appropriate events
  • ObjectRecognizer has been extensively changed to use the local model. This I will describe in detail

Object recognition - the Windows ML way

The first part of the ObjectRecognizer initializes the model

using UnityEngine;
#if UNITY_WSA && !UNITY_EDITOR
using System.Threading.Tasks;
using Windows.Graphics.Imaging;
using Windows.Media;
#endif

public class ObjectRecognizer : MonoBehaviour
{
#if UNITY_WSA && !UNITY_EDITOR
    private ObjectDetection _objectDetection;
#endif

    private bool _isInitialized;

    private void Start()
    {
        Messenger.Instance.AddListener<PhotoCaptureMessage>(
          p=> RecognizeObjects(p.Image, p.CameraResolution, p.CameraTransform));
#if UNITY_WSA && !UNITY_EDITOR _objectDetection = new ObjectDetection(new[]{"aircraft"}, 20, 0.5f,0.3f ); Debug.Log("Initializing..."); _objectDetection.Init("ms-appx:///Data/StreamingAssets/model.onnx").ContinueWith
(p => { Debug.Log("Intializing ready"); _isInitialized = true; }); #endif }

Notice, here, too the liberal use of preprocessor directives, just like in my previous post. In the start of it's method we create a model from the ONNX file that's in StreamingAssets, using the method I added to ObjectDetection. Since we can't make the start method awaitable, the ContinueWith needs to finish the initalization.

As you can see, the arrival of a PhotoCapture message from the CameraCapture behavior fires off RecognizeObjects, just like in the previous app.

public virtual void RecognizeObjects(IList<byte> image, 
                                     Resolution cameraResolution, 
                                     Transform cameraTransform)
{
    if (_isInitialized)
    {
#if UNITY_WSA && !UNITY_EDITOR
        RecognizeObjectsAsync(image, cameraResolution, cameraTransform);
#endif

    }
}

But unlike the previous app, it does not fire off a Unity coroutine, but a private async method

#if UNITY_WSA && !UNITY_EDITOR
private async Task RecognizeObjectsAsync(IList<byte> image, Resolution cameraResolution, Transform cameraTransform)
{
    using (var stream = new MemoryStream(image.ToArray()))
    {
        var decoder = await BitmapDecoder.CreateAsync(stream.AsRandomAccessStream());
        var sfbmp = await decoder.GetSoftwareBitmapAsync();
        sfbmp = SoftwareBitmap.Convert(sfbmp, BitmapPixelFormat.Bgra8, 
BitmapAlphaMode.Premultiplied); var picture = VideoFrame.CreateWithSoftwareBitmap(sfbmp);
var prediction = await _objectDetection.PredictImageAsync(picture); ProcessPredictions(prediction, cameraResolution, cameraTransform); } } #endif

This method basically is 70% converting the raw bits of the image to something the ObjectDetection class's PredictImageAsync can handle. I have very much to thank this post in the Unity forums and this post on the MSDN blog site by my friend Matteo Pagani to piece this together. This is because I am a stubborn idiot - I want to take a picture in stead of using a frame of the video recorder, but then you have to convert the photo to a video frame.

The 2nd to last code actually calls the PredictImageAsync - essentially a black box for the app, and then the predictions are processed more or less like before:

#if UNITY_WSA && !UNITY_EDITOR
private void ProcessPredictions(IList<PredictionModel>predictions, 
                                Resolution cameraResolution, Transform cameraTransform)
{
    var acceptablePredications = predictions.Where(p => p.Probability >= 0.7).ToList();
    Messenger.Instance.Broadcast(
       new ObjectRecognitionResultMessage(acceptablePredications, cameraResolution, 
                                          cameraTransform));
}
#endif

Everything with a probability lower than 70% is culled, and the rest is being send along to the messenger, where the ObjectLabeler picks it up again and starts shooting for the Spatial Map in the center of all rectangles in the predications to find out where the actual object may be in space.

Conclusion

I have had some fun experimenting with this, and the conclusions are clear:

  • For a simple model as this, even with a fast internet connection, using a local model in stead of a cloud based model is way faster
  • Yet - the hit rate is notably lower - the cloud model is definitely more 'intelligent'. I suppose improvements to Windows ML will fix that in the near future. Also, the AI coprocessor the next release of HoloLens will undoubtedly contribute to both speed and accuracy.
  • With 74 pictures of a few model airplanes, almost all on the same background, my model is not nearly enough equipped to recognize random planes in random environments. This highlights a bit the crux of machine learning - you will need data, data more data and even more than that.
  • This method of training models in the cloud and executing them locally provides exiting new - an very usable - features for Mixed Reality devices.

Using Windows ML in edge devices is not hard, and on a HoloLens is only marginally harder because you have to circumvent an few differences between full UWP and Unity, and be aware of differences between C# 4.0 and C# 7.0. This can easily be addressed, as I showed before.

The complete project can be found here (branch WinML) - since in now operates without a cloud model it is actually runnable by everyone. I wonder if you can actually get it to recognize model planes you may have around. I've got it to recognize model planes up to about 1.5 meters.

27 January 2019

Adapting Custom Vision Object Recognition Windows ML code for use in Mixed Reality applications

Intro

In November I wrote about a Custom Vision Object Detection experiment that I did, which allowed the HoloLens I was wearing to recognize not only what objects where in view, but also where they approximately were in space. You might remember this picture:

You might also remember this one:

Apart from being a very cool new project type, it also showed a great limitation. You could only use an online model. You could not download it in the form of, for instance, an ONNX model to use with Windows ML. It worked pretty well, don't get me wrong, but maybe you are out and about and your device can't always reach the online model. Well guess what recently changed:

Yay! Custom Vision Object Detection now support downloadable models that can be use in Windows ML.

Download model and code

After you have changed the type from 'General" to "General (compact)" and saved that change, hit the "Performance" tab, then you will see the "Export" option appear (no idea why this is at "Performance", but what the heck:

So if you click that, you get a bit of an unwieldy screen that looks like this:

We are going to select the ONNX standard because that is what we can in Windows Machine Learning - inside an UWP app running on the HoloLens. Please select version 1.2:










The result is a ZIP file containing the following folders and files:




We are only going to need the model.onnx file (in the next blog post). For now I want to concentrate on the file that is inside the CSharp folder - ObjectDetection.cs. That file is very fine for using in a regular UWP app. However, although they are running on top of UWP, HoloLens apps are all but regular UWP apps.

Challenges in incorporating the C# code in an Unity project

Some interesting challenges lay ahead:

  • Unity for HoloLens knows this unusual concept of having two Visual studio solutions: one for use in the Unity editor, and a second one that is generated from the first. But the first one, the Unity solution needs to be able to swallow all the code, even if it's UWP and will never run in the editor. To make that possible, we will have to put some stuff into preprocessor directives to be able to generate the deployment project at all
  • The code uses C# 7.0 concepts - tuples - that are not supported by the C# version (4.0) supported in all but the newest versions of Unity, that I am not using here for various reasons
  • I also found a pretty subtle bug in the code that only happens in a Unity runtime

I will address all three things.

Testing in a bare bones project - here come the errors

So, I created an empty HoloLens project basically doing nothing: just imported the Mixed Reality Toolkit and hit all three configuration options in the Mixed Reality Toolkit/config menu. Then I added the ObjectDetection.cs to the project and immediately Unity started to balk:

Round 1 - preprocessor directives

The first round of fixing is pretty simple - just put everything he editor balks about between preprocessor directives:

#if !UNITY_EDITOR 
#endif

You can do this the rough way - by basically putting the whole file in these directives - or only put the minimum stuff in directives. I usually opt for the second way. So we need to put the following parts between these preprocessor directives.

First, this part in the using section of the start of the file:

    using System;
    using System.Collections.Generic;
    using System.Diagnostics;
    using System.Linq;
#if !UNITY_EDITOR
    using System.Threading.Tasks;
    using Windows.AI.MachineLearning;
    using Windows.Media;
    using Windows.Storage;
#endif

Then this part, at the start of the ObjectDetection class:

    public class ObjectDetection
    {
        private static readonly float[] Anchors = ....

        private readonly IList<string> labels;
        private readonly int maxDetections;
        private readonly float probabilityThreshold;
        private readonly float iouThreshold;
#if !UNITY_EDITOR
        private LearningModel model;
        private LearningModelSession session;
#endif

Then the following methods need to be put into entirely between these preprocessor directives:

  • Init
  • ExtractBoxes
  • Postprocess

And then both Unity and Visual Studio stop complaining about errors. So let's build the UWP solution...

Oops. Well I already warned you about this.

Round 2 - Tuples are a no-no

Although the very newest versions of Unity support C# 7.0 the majority of the versions that are used today for various reasons (mainly hologram stability) do not. But the code generated by CustomVision has some tuples in it. The culprit is ExtractBoxes:

private (IList<BoundingBox>, IList<float[]>) ExtractBoxes(TensorFloat predictionOutput,
float[] anchors)

So we need to refactor this to C# 4 style code. Fortunately, this is not quite rocket science.

First of all, we define a class with the same properties as the tuple:

internal class ExtractedBoxes
{
    public IList<BoundingBox> Boxes { get; private set; }
    public IList<float[]> Probabilities { get; private set; }

    public ExtractedBoxes(IList<BoundingBox> boxes, IList<float[]> probs)
    {
        Boxes = boxes;
        Probabilities = probs;
    }
}

I have added this to the ObjectDetection.cs file, just behind the end of the ObjectDetection class definition. Then, we only need to change the return value of the method ExtractBoxes from

private (IList<BoundingBox>, IList<float[]>)

to

 return new ExtractedBoxes(boxes, probs);

We also have to change the mode PostProcess, the place where ExtractBoxes is used:

private IList<PredictionModel> Postprocess(TensorFloat predictionOutputs)
{
    var (boxes, probs) = this.ExtractBoxes(predictionOutputs, ObjectDetection.Anchors);
    return this.SuppressNonMaximum(boxes, probs);
}

needs to become

private IList<PredictionModel> Postprocess(TensorFloat predictionOutputs)
{
    var extractedBoxes = this.ExtractBoxes(predictionOutputs, ObjectDetection.Anchors);
    return this.SuppressNonMaximum(extractedBoxes.Boxes, extractedBoxes.Probabilities);
}

and then, dear reader, Unity will finally build the deployment UWP solution. But there is still more to do.

Round 3 - fix a weird crashing bug

When I tried this in my app - and you will have to take my word for it - my app randomly crashed. The culprit, after long debugging, turned out to be this line:

private IList<PredictionModel> SuppressNonMaximum(IList<BoundingBox> boxes, 
IList<float[]> probs) { var predictions = new List<PredictionModel>(); var maxProbs = probs.Select(x => x.Max()).ToArray(); while (predictions.Count < this.maxDetections) { var max = maxProbs.Max();

I know, it doesn't make sense. I have not checked this in plain UWP, but apparently the implementation of Max() in the Unity player on top of UWP doesn't like to calculate the Max of and empty list. My app worked fine as long as there were recognizable object in view. If there were none, it crashed. So, I changed that piece to check for probs not being empty first:

private IList<PredictionModel> SuppressNonMaximum(IList<BoundingBox> boxes, IList<float[]> probs)
{
    var predictions = new List<PredictionModel>();
    // Added JvS
    if (probs.Any())
    {
        var maxProbs = probs.Select(x => x.Max()).ToArray();

        while (predictions.Count < this.maxDetections)
        {
            var max = maxProbs.Max();

And then your app will still be running when there's no predictions.

Round 4 - some minor fit & finish

Because I am lazy and it makes life easier when using this from a Unity app, I added this little overload of the Init method:

public async Task Init(string fileName)
{
    var file = await StorageFile.GetFileFromApplicationUriAsync(new Uri(fileName));
    await Init(file);
}

This will need to be in and #if !UNITY_EDITOR preprocessor directive as well. This method allows me to call the method like this without first getting a StorageFile:

_objectDetection.Init("ms-appx:///Data/StreamingAssets/model.onnx");

Conclusion

With these adaptions you have a C# file that will allow you to use Windows ML from both Unity and regular UWP apps. In a following blog post I will actually show a refactored version of the Toy Aircraft Finder to show how things work IRL.

There is no real demo project this time (yet) but if you want to download the finished file already, you can do so here.

17 January 2019

Making lines selectable in your HoloLens or Windows Mixed Reality application

Intro

Using the Mixed Reality Toolkit, it's so easy to make an object selectable. You just add a behavior to your object that implements IInputClickHandler, fill in some code in the OnInputClicked method, and you are done. Consider for instance this rather naïve implementation of a behavior that toggles the color from the original to red and back when clicked:

using HoloToolkit.Unity.InputModule;
using UnityEngine;

public class ColorToggler : MonoBehaviour, IInputClickHandler
{
    [SerializeField]
    private Color _toggleColor = Color.red;

    private Color _originalColor;

    private Material _material;
	void Start ()
	{
	    _material = GetComponent<Renderer>().material;
        _originalColor = _material.color;
	}
	
    public void OnInputClicked(InputClickedEventData eventData)
    {
        _material.color = _material.color == _originalColor ? _toggleColor : _originalColor;
    }
}

If you add this behavior on for instance a simple Cube, the color will flick from whatever the original color was (in my case blue) to red and back when you tap it. But add this behavior to a line and and attempt to tap it - and nothing will happen.

So what's a line, then?

In Unity, a Line is basically an empty game object containing a LineRender component. You can access the LineRender using the standard GetComponent, then using it's SetPosition method to actually set the points. You can see how it's done in the demo project, in which I created a class LineController to make drawing the line a bit easier:

public class LineController : MonoBehaviour
{
    public void SetPoints(Vector3[] points)
    {
        var lineRenderer = GetComponent<LineRenderer>();
        lineRenderer.positionCount = points.Length;
        for (var i = 0; i < points.Length; i++)
        {
            lineRenderer.SetPosition(i, points[i]);
        }
        //Stuff omitted
    }
}

This is embedded in a prefab "Line". Here you might see the root cause of the problem. The difference between a line and, for instance a cube is simple: there is no mesh, but more importantly - there is no collider. Compare this with the cube next to it:



















So... how do we add a collider, then?

That is not very hard. Find the prefab "Line", and add a "Line Collider Drawer" component. This is sitting in "HoloToolkitExtensions/Utilities/Scripts.

Once you have done that, try to click the line again

And hey presto - the line is not only selectable, but even the MRKT Standard Shader Hover Light option, that I selected in creating the line material, actually works.

And in code, it works like this

First of all, in the LineController, I wrote "//Stuff omitted". That stuff actually calls the LineColliderDrawer (or at least, it tries to):

public class LineController : MonoBehaviour
{
    public void SetPoints(Vector3[] points)
    {
        var lineRenderer = GetComponent<LineRenderer>();
        lineRenderer.positionCount = points.Length;
        for (var i = 0; i < points.Length; i++)
        {
            lineRenderer.SetPosition(i, points[i]);
        }
        
        var colliderDrawer = GetComponent<LineColliderDrawer>();
        if (colliderDrawer != null)
        {
            colliderDrawer.AddColliderToLine(lineRenderer);
        }
    }
}

The main part of LineColliderDrawer is this method:

private  void AddColliderToLine( LineRenderer lineRenderer, 
    Vector3 startPoint, Vector3 endPoint)
{
    var lineCollider = new GameObject(LineColliderName).AddComponent<CapsuleCollider>();
    lineCollider.transform.parent = lineRenderer.transform;
    lineCollider.radius = lineRenderer.endWidth;
    var midPoint = (startPoint + endPoint) / 2f;
    lineCollider.transform.position = midPoint;

    lineCollider.transform.LookAt(endPoint);
    var rotationEulerAngles = lineCollider.transform.rotation.eulerAngles;
    lineCollider.transform.rotation =
        Quaternion.Euler(rotationEulerAngles.x + 90f, 
        rotationEulerAngles.y, rotationEulerAngles.z);

    lineCollider.height = Vector3.Distance(startPoint, endPoint);
}

This is partially inspired by this post in the Unity forums, and partially by this one. Although I think they are both not entirely correct, it certainly put me on the right track.

Basically it creates an empty game object, and add a capsule collider to that. The collider is set to the end with of the line, which is assumed to be of constant width. It's midpoint is set exactly halfway the line (segment) and then rotated to look at the end point. Oddly enough, it's then at 90 degrees with the actual line segment, so the collider is rotated 90 degrees over it's X axis. Finally, it is stretched to cover the whole line segment.

The rest of the class is basically a support act:

public class LineColliderDrawer : MonoBehaviour
{
    private const string LineColliderName = "LineCollider";

    public void AddColliderToLine(LineRenderer lineRenderer)
    {
        RemoveExistingColliders(lineRenderer);

        for (var p = 0; p < lineRenderer.positionCount; p++)
        {
            if (p < lineRenderer.positionCount - 1)
            {
                AddColliderToLine(lineRenderer, 
                    lineRenderer.GetPosition(p), 
                    lineRenderer.GetPosition(p + 1));
            }
        }
    }

    private void RemoveExistingColliders(LineRenderer lineRenderer)
    {
        for (var i = lineRenderer.gameObject.transform.childCount - 1; i >= 0; i--)
        {
            var child = lineRenderer.gameObject.transform.GetChild(i);
            if (child.name == LineColliderName)
            {
                Destroy(child.gameObject);
            }
        }
    }
  }

This basically first removes any existing colliders, then adds colliders to the line for every segment - basically a line of n points gets n-1 colliders.

Concluding words

And that's basically it. Now lines can be selected as well. Thanks to both Unity forum posters who gave me two half-way parts that allowed me to combine this into one working solution.

19 December 2018

Improving Azure Custom Vision Object Recognition by using and correcting the prediction pictures

Intro

Last month I wrote about integrating Azure Custom Vision Object Recognition with HoloLens to recognize to label objects in 3D space. I wrote the prediction went pretty well, although I used only 35 pictures. I also wrote the process of taking, uploading and labeling pictures is quite tedious.

Improving on the go

It turns out Custom Vision retained all the pictures I uploaded in the cause of testing. So every time I used my HoloLens and asked Custom Vision to locate toy aircraft, it stored that in the cloud with it's prediction. And the fun thing is, you can use those pictures to actually improve your model again.

After some playing around with my model my (for the previous blog post about this subject) , I clicked the prediction tab, and I found about 30 pictures - each for every time I used the model from my HoloLens. I could use those to improve my model. After that, I did some more testing using the HoloLens to show you how it's done. So, I clicked the Predictions tab and there were a couple of more pictures:

image

If we select the first picture, we see this:

image

The model has already annotated the areas where it thinks is a an airplane in red. Interestingly now the model is a lot better than it originally was (when it only featured my pre-loaded images) as it now recognizes the DC-3 Dakota on top - that it has never seen before - as an airplane! And even the X-15 (the black thing on the left ) is recognized. Although the X-15 had a few entries in the training images it barely looks like an airplane (for all intents and purpose it was more a spaceship with wings to facilitate a landing).

I digress. You need to click every area you want to confirm:

image

And when you are done, and all relevant areas are white:

image

Simply click the X top right. The image will now disappear from the "Predictions" list and end up in the "Training images" list

Some interesting things to note

The model really improved from adding the new images. Not only did it recognize the DC3 'Dakota' that had not been in the training images, but also this Tiger Moth model (the bright yellow one) that it had never  seen before:

image

Also, it stopped recognizing or doubting things like the HoloLens pouch that's lying there, and my headphones and hand were also recognized as 'definitely not an airplane'

image

Yet, I also learned it's dangerous to take the same background over and over again. Apparently the model starts to rely on that. If I put the Tiger Moth on a dark blue desk chair in stead of a light blue bed cover

image

Yes...  the model is quite confident airplane in the picture but it's not very good at pinpointing it.

image

And as far as the Curtiss P40 'Kittyhawk' goes - even though it has been featured extensively in both the original training pictures and the ones I added from the Predictions, this no is success either. The model is better at pinpointing the aircraft, but considerably less sure it is an aircraft. And the outer box, that includes the chair, gives a 30.5%. So in looks that to make this model even more reliable I still need more pictures but then on other background, more different lighting, etc.

Conclusion

You don't have to take very much pictures up front to incrementally improve a Custom Vision Object Recognition model - you can just iterate on it's predictions and improve them. It feels a bit like teaching a toddler how to build something from Legos - you first show the principle, then let them muck around, and every time things goes wrong, you show how it should have been done. Gradually they get the message. Or at least, that's what you hope. ;)

No (new) code this time, as the code from last time is unchanged.

Disclaimer - I have no idea how much prediction pictures are stored and for how long - I can imagine not indefinitely, and not an unlimited amount. But I can't attach numbers to that.

08 December 2018

Mixed Reality Toolkit vNext–dependency injection with extension services

Intro

The Mixed Reality Toolkit vNext comes with an awesome mechanism for dependency injection. This also takes away a major pain point – all kinds of behaviors that are singletons that are called from everywhere, leading to all kind of interesting timing issues - and tightly coupled classes. This all ends wit extension services, which piggyback on the plugin structure of the MRKT-vNext. In this post I will describe how you make, configure and use such an extension service

Creating an extension service

A service that can be used by the extension service framework (and be found by the inspector dropdown that I will show later) needs to implement IMixedRealityExtensionService at the very least. But of course we want to have the service make do something useful so I made a child interface:

using Microsoft.MixedReality.Toolkit.Core.Interfaces;

namespace Assets.App.Scripts
{
    public interface ITestDataService : IMixedRealityExtensionService
    {
        string GetTestData();
    }
}

the method GetTestData is the method we want to use.

Any class implementing IMixedRealityExtensionService needs to implement six methods and two properties. And to be usable by the framework, it needs to have this constructor:

<ClassName>(string name, uint priority)

To make this a little more simple, the MRKT-vNext contains a base class BaseExtensionService that provides default implementation for all the required stuff. And thus we can make a TestDataService very simple, as it a) implements all properties and b) forces us to provide the necessary constructor:

using Microsoft.MixedReality.Toolkit.Core.Services;
using UnityEngine;

namespace Assets.App.Scripts
{
    public class TestDataService : BaseExtensionService, ITestDataService
    {
        public TestDataService(string name, uint priority) : base(name, priority)
        {
        }

        public string GetTestData()
        {
            Debug.Log("GetTestData called");
            return "Hello";
        }
    }
}

Registering the service in the framework

Check if a custom profile has been selected. Assuming you have followed the procedure I described in my previous post, you can do you this by selecting the MixedRealityToolkit game object in your scene and then double-clicking the “Active Profile” field

image

If the UI is read-only, there’s no active custom profile. Check if there’s a profile in MixedRealityToolkit-Generated/CustomProfiles and drag that on top of the ActiveProfile field of the MixedRealityTool object. If there’s no custom profile at all, Please refer to my previous blog post.

Scroll all the way down to Additional Service Providers.

image

Click the </> button. This creates a MixedRealityRegisteredServiceProvidersProfile in
MixedRealityToolkit-Generated/CustomProfiles and shows this editor.

image

Click “+ Register a new Service Provider”. This results in a “New Configuration 8” that if you expand it, looks like this:

image

If you click the “Component Type” drop down you should be able to select “Assets.Apps.Scripts” and then “TestDataService”.

image

I also tend to give this component a bit more understandable name so the final result looks like this:

image

Calling the service from code

A very simple piece of code shows how you can then retrieve the and use the service from the MixedRealityToolkit:

using Microsoft.MixedReality.Toolkit.Core.Services;
using UnityEngine;

namespace Assets.App.Scripts
{
    public class TestCaller : MonoBehaviour
    {
        private void Start()
        {
            var service  = MixedRealityToolkit.Instance.GetService<ITestDataService>();
            Debug.Log("Service returned " + service.GetTestData());
        }
    }
}

Notice I can retrieve the implementation using my own interface type. This very is similar to what we are used to do in ‘normal’ IoC containers like Unity (the other one), AutoFac, SimpleIoC. If you attach this behaviour to any game object in the hierarchy (I created an empty object “Managers” to this extent), and run this project, you will simply see this in the console:

image

It’s not spectacular, but it proves the point that this is working as expected

Conclusion

MRTK-vNext provides a very neat visual select mechanism for wiring up dependency injection that is very easy to use. I can also easily retrieve implementations of the service using an interface, just like any other IoC platform. The usage of profiles makes it very flexible and reusable. This alone makes it a great framework, and then I have not even looked into the cross-platform stuff. That I will do soon. Stay tuned.

In the mean time, the demo project can be found here.

07 December 2018

Mixed Reality Toolkit vNext–setting up a project

Intro

You might have heard it – the Mixed Reality Toolkit folks are busy with a major rewrite. The original MRTK was intended to accelerate applications targeted toward HoloLens and (later) Windows Mixed Reality immersive headsets. The new version “aims to further extend the capabilities of the toolkit and also introduce new features, including the capability to support more VR/AR/XR platforms beyond Microsoft's own Mixed Reality setup”.

I’ve been quite busy but finally found some time to play with it. And while I am doing that, I am going to shoot off some smaller and bigger blogs about things I learned – both for myself to remember and you to enjoy.

Be advised: it’s still heavy in development. The first beta release explicitly states:

This is a pre-release version and is not feature complete.

  • This release is not intended for production use.
  • This release contains many breaking changes from previous HoloToolkit and Mixed Reality Toolkit vNext releases.

So let’s dive in. In this first blog post on this subject, I will simply describe how to set up an empty project.

Cloning the latest MRTK-vNext from GitHub

This is pretty standard procedure. I prefer to use TortoiseGit for this, as I seem to be of the dying creed that’s not particularly fond of command lines. The repo is here (as it has been for a while)

After you have cloned the project, check out branch mrtk_development. This is the bleeding edge. This is where the most things happen (and the most things break ;) ).

Creating a new Unity project

Also fairly standard. You will need Unity 2018.2.18f1 for that. Or, by the time you will read this, probably an ever newer version. After you have created the project, close Unity again.

Adding the MRTK to you project

From the MRKT repo, copy the following folders and files to the assets folder of your project:

  • MixedRealityToolkit
  • MixedRealityToolkit-SDK
  • MixedRealityToolkit.meta
  • MixedRealityToolkit-SDK.meta

Configuring the MRKT-vNext components in your scene

Configuring the MRTK has become a whole lot easier.

  • Open and existing scene or create one.
  • Click Mixed Reality Toolkit/Configure
  • You will see two game objects appear in your scene
    image

Now Unity starts to complain about there being no camera tagged as MainCamera. This can easily be fixetd by tagging it manually:

image

Basically you know have an empty project

Preparing a custom configuration

If you click the MixedRealityToolkit game object, on the DefaultMixedRealityToolkitConfigurationProfile field:

image

you will see this this appear in the inspector

image

If you then click the button “Copy & Customize” it will create a new folder MixedRealityToolkit-Generated, and in that a CustomProfiles folder. And in that a MixedRealityToolkitConfigurationProfile

image

Make sure to check that the new profile is actually applied:

image

You will now see that the settings are no long greyed out, and now you can change and swap out components.

image

The fun thing is, these are not longer all MonoBehaviours and (most of all) no longer singletons. The MixedRealityToolkit class is the only ‘singleton’ left. The MixedRealityToolkitConfigurationProfile is a so called ‘scriptable object’ that can hold the configuration of the whole MRTK. But the MixedRealityToolkitConfigurationProfile  is more or less a hat stand for all kinds of other partial configurations, all of which will end up in a profile as well.

Concluding words

We have taken the first (very small) baby steps into configuring the MRTK-vNext. We actually did not write any code and therefore, unlike most of my other posts, this does not come with a sample project. The next one will, though.

24 November 2018

Using Azure Custom Vision Object Recognition and HoloLens to identify and label objects in 3D space

Intro

HoloLens is cool, Machine Learning is cool, what's more fun than combine these two great techniques. Very recently you could read "Back to the future now: Execute your Azure trained Machine Learning models on HoloLens!"  on the AppConsult blog, and as early as last May my good friend Matteo Pagani wrote on the same blog about his very first experiments with WindowsML - as the technology to run machine learning models on your Windows ('edge') devices is called. Both of the blog posts use an Image Classification algorithm, which basically tells you whether or not an object is in the image, and what the confidence level of this recognition is.

And then this happened:

image"Object Detection finds the location of content within an image" is the definition that pops up if you hover your mouse over the (i) symbol behind "Project Types". So not only do you get a hit and a confidence level but also the location in the image where the object is.

Now things are getting interesting. I wondered if I could use this technique to detect objects in the picture and then use HoloLens' depth camera to actually guestimate where those object where in 3D space.




The short answer: yes. It works surprisingly good.

20181114_131716_HoloLens

The global idea

  • User air taps to initiate the process
  • The HoloLens takes a quick picture and uploads the picture to the Custom Vision API
  • HoloLens gets the recognized areas back
  • Calculates the center of each area with a confidence level < 0.7
  • 'Projects' these centers on a plane 1 m wide and 0.56 high that's 1 meter in front of the Camera (i.e. the user's viewpoint)
  • 'Shoots' rays from the Camera through the projected center points and checks if and where the strike the Spatial Map
  • Places labels on the detected points (if any).

Part 1: creating and training the model

Matteo already wrote about how simple it actually is to create an empty model in CustomVision.ai so I skip that part. Inspired by his article I wanted to recognize airplanes as well, but I opted for model airplanes - much easier to test with than actual airplanes. So I dusted off all the plastic airplane models I had built during my late teens - this was a thing shy adolescent geeks like me sometimes did, back in the Jurassic when I grew up ;) - it helped we did not have spend 4 hours per day on social media ;). But I digress. I took a bunch of pictures of them:

image

And then, picture by picture, I had to mark and label the areas which contains the desired objects. This is what is different from training a model for 'mere' object classification: you have to mark every occasion of your desired object.

image

This is very easy to do, it's a bit boring and repetitive, but learning stuff takes sacrifices, and in the end I had quite an ok model. You train in it just the same way as Matteo already wrote about - by hitting big green 'Train'  button that's kind of hard to miss on the top right.

When you are done, you will need two things:

  • The Prediction URL
  • The Prediction key.

You can get those by clicking the "Performance" tab on top:

image

Then click the "Prediction URL" tab

image

And this will make this popup appear with the necessary information

image

Part 2: Building the HoloLens app to use the model

Overview

The app is basically using three main components:

  • CameraCapture
  • ObjectRecognizer
  • ObjectLabeler

They sit in the Managers object and communicate using the Messenger that I wrote about earlier.

Part 2a: CameraCapture gets a picture - when you air tap

imageIt's not exactly clear who originally came up with a saying like "great artist steal" but although I don't claim any greatness I do steal. CameraCapture is a slightly adapted version of this article in the Unity documentation. There are only a few changes. The original always captures the image in the "BRA32" format as this can be used as texture on a plane or quad. Unfortunately that is not a format CustomVision accepts. The app does show the picture it takes before the user's eye if the DebugPane property is set to a game object (in the demo project it is). Should you not desire this, simply clear the "Debug Pane" field in the "Camera Capture" script in the Unity editor.



So what you basically see is that CameraCapture takes a picture in a format based upon whether or not the DebugPane is present:

 pixelFormat = _debugPane != null ? CapturePixelFormat.BGRA32 : CapturePixelFormat.JPEG

and then either directly copies the captured (JPEG) photo into the photoBuffer, or it shows in on the DebugPane and as BRA32 and converts it to JPEG from there

void OnCapturedPhotoToMemory(PhotoCapture.PhotoCaptureResult result, 
PhotoCaptureFrame photoCaptureFrame) { var photoBuffer = new List<byte>(); if (photoCaptureFrame.pixelFormat == CapturePixelFormat.JPEG) { photoCaptureFrame.CopyRawImageDataIntoBuffer(photoBuffer); } else { photoBuffer = ConvertAndShowOnDebugPane(photoCaptureFrame); } Messenger.Instance.Broadcast( new PhotoCaptureMessage(photoBuffer, _cameraResolution, CopyCameraTransForm())); // Deactivate our camera _photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode); }

The display and conversion is done this way:

private List<byte> ConvertAndShowOnDebugPane(PhotoCaptureFrame photoCaptureFrame)
{
    var targetTexture = new Texture2D(_cameraResolution.width, 
      _cameraResolution.height);
    photoCaptureFrame.UploadImageDataToTexture(targetTexture);
    Destroy(_debugPane.GetComponent<Renderer>().material.mainTexture);

    _debugPane.GetComponent<Renderer>().material.mainTexture = targetTexture;
    _debugPane.transform.parent.gameObject.SetActive(true);
    return new List<byte>(targetTexture.EncodeToJPG());
}

It creates a texture, uploads the buffer into it, destroys the current texture and sets the new texture. Then the object game object is actually being displayed, and then it's used to convert the image to JPEG

Either way, the result is a JPEG, and the buffer contents are sent on a message, together with the camera resolution and a copy of the Camera's transform. The resolution we need to calculate the height/width ratio of the picture, and the transform we need to retain because in between the picture being taken and the result coming back the user may have moved. Now you can't just send the Camera's transform, when the user moves. So you have to send a 'copy', which is made by this rather crude method, using a temporary empty gameobject:

private Transform CopyCameraTransForm()
{
    var g = new GameObject();
    g.transform.position = CameraCache.Main.transform.position;
    g.transform.rotation = CameraCache.Main.transform.rotation;
    g.transform.localScale = CameraCache.Main.transform.localScale;
    return g.transform;
}

Part 2b: ObjectRecognizer sends it to CustomVision.ai and reads results

The ObjectRecognizer is, apart from some song and dance to pick the message apart and start a Coroutine, a fairly simple matter. This part does all the work:

private IEnumerator RecognizeObjectsInternal(IEnumerable<byte> image, 
    Resolution cameraResolution, Transform cameraTransform)
{
    var request = UnityWebRequest.Post(_liveDataUrl, string.Empty);
    request.SetRequestHeader("Prediction-Key", _predictionKey);
    request.SetRequestHeader("Content-Type", "application/octet-stream");
    request.uploadHandler = new UploadHandlerRaw(image.ToArray());
    yield return request.SendWebRequest();
    var text = request.downloadHandler.text;
    var result = JsonConvert.DeserializeObject<CustomVisionResult>(text);
    if (result != null)
    {
        result.Predictions.RemoveAll(p => p.Probability < 0.7);
        Debug.Log("#Predictions = " + result.Predictions.Count);
        Messenger.Instance.Broadcast(
            new ObjectRecognitionResultMessage(result.Predictions, 
            cameraResolution, cameraTransform));
    }
    else
    {
        Debug.Log("Predictions is null");
    }
}

You will need to set _liveDataUrl and predictionKey values via the editor, as you could see in the image just below the Part 2a header. This behaviour creates a web request to the prediction URL, adds the prediction key as header, and the right content type. The body content is set to the binary image data using an UploadHandlerRaw. And then the request is sent to CustomVision.ai. The result is then deserialized into a CustomVisionResult object, all the predictions with a probability lower than the 0.7 threshold are removed, and the predications are put back into a message, to be sent to the ObjectLabeler, together once again with the camera's resolution and transform.

A little note: the CustomVisionResult together with all the classes it uses are in the CustomVisionResult.cs file in the demo project. This code was generated by first executing executing the SendWebRequest and then copying the raw output of "request.downloadhandler.text" into QuickType. It's an ideal site to quickly make classes for JSON serialization.

Interestingly to note here is that Custom Vision returns bounding boxes by giving top,left, width and height - in values that are always between 0 and 1. So if the top/left of your picture sits at (0,0) it's all the way to the top/left of the picture, and (1,1) is a the bottom right of the picture. Regardless of the height/with ratio of your picture. So if your picture is not square (and most cameras don't create square pictures)) you need to know the actual width and height of your picture - that way, you can calculate what pixel coordinates actually correspond to the numbers Custom Vison returns. And that's exactly what the next step does.

Part 2c: ObjectLabeler shoots for the Spatial Map and places labels

The ObjectLabeler also contains pretty little code as well, although the calculations may need a bit of explanation. The central piece of code is this method:

public virtual void LabelObjects(IList<Prediction> predictions, 
    Resolution cameraResolution, Transform cameraTransform)
{
    ClearLabels();
    var heightFactor = cameraResolution.height / cameraResolution.width;
    var topCorner = cameraTransform.position + cameraTransform.forward -
                    cameraTransform.right / 2f +
                    cameraTransform.up * heightFactor / 2f;
    foreach (var prediction in predictions)
    {
        var center = prediction.GetCenter();
        var recognizedPos = topCorner + cameraTransform.right * center.x -
                            cameraTransform.up * center.y * heightFactor;

        var labelPos = DoRaycastOnSpatialMap(cameraTransform, recognizedPos);
        if (labelPos != null)
        {
            _createdObjects.Add(CreateLabel(_labelText, labelPos.Value));
        }
    }

    if (_debugObject != null)
    {
         _debugObject.SetActive(false);
    }

    Destroy(cameraTransform.gameObject);
}

First, we clear any labels that might have been created in a previous run. Then we calculate the height/width ratio of the picture (this is 2048x1152, so heightFactor will always be 0.5625, but why hard code something that can be calculated). Then comes the first interesting part. Remember that I wrote we are projecting the picture on a plane 1 meter before the user. We do this because the picture then looks pretty much live sized. So we need to go forward 1 meter from the camera position:

cameraTransform.position + cameraTransform.forward.normalized

But then we end up in the center of the plane. We need to get to the top left corner as a starting point. So we go half a meter to the left (actually, -1 * right, which amounts to left), then half the height factor up.

cameraTransform.up * heightFactor / 2f

In image, like this:

image

Once we are there, we calculate the center of the prediction using a very simple extension method:

public static Vector2 GetCenter(this Prediction p)
{
    return new Vector2((float) (p.BoundingBox.Left + (0.5 * p.BoundingBox.Width)),
        (float) (p.BoundingBox.Top + (0.5 * p.BoundingBox.Height)));
}

To find the actual location on the image, we basically use the same trick again in reverse: first move to the right the amount the x is from the top corner

var recognizedPos = topCorner + cameraTransform.right * center.x

And then a bit down again (actually , -up) using the y value scaled for height.

-cameraTransform.up * center.y * heightFactor;

Then we simply do a ray cast to the spatial map from the camera position through the location we calculated, basically shooting 'through' the picture for the real object.

private Vector3? DoRaycastOnSpatialMap(Transform cameraTransform, 
                                       Vector3 recognitionCenterPos)
{
    RaycastHit hitInfo;

    if (SpatialMappingManager.Instance != null && 
        Physics.Raycast(cameraTransform.position, 
                       (recognitionCenterPos - cameraTransform.position), 
            out hitInfo, 10, SpatialMappingManager.Instance.LayerMask))
    {
        return hitInfo.point;
    }
    return null;
}

and create the label at the right spot. I copied the code for creating the label from two posts ago, so I will skip repeating that here.

There is little bit I want to repeat here

if (_debugObject != null)
{
     _debugObject.SetActive(false);
}

Destroy(cameraTransform.gameObject);

If the debug object is set (that is to say, the plane showing the photo HoloLens takes to upload) it will be turned off here otherwise it obscures the actual labels. But more importantly is the last line: I created the copy of the camera's transform using a temporary game object. As the user keeps on shooting pictures those will add up and clutter the scene. So after the work is done, I clean it up.

And the result...

The annoying thing is, al always, I can't show you a video the whole process as any video recording stops as soon as the app takes a picture. So the only think I can show you is this kind of doctored video - I restarted video immediately after taking the picture, but I miss the part of where the actual picture is floating in front of the user. This is how it looks like, though, if you disable the debug pane from the Camera Capture script:

Lessons learned

  • There is a reason why Microsoft says you need at least 50 pictures for a bit reliable recognition. I took about 35 pictures of about 10 different models of airplanes. I think I should have take more like 500 pictures (50 of every type of model airplanes) and then things would have gone a lot better. Nevertheless, it already works pretty well
  • If the camera you use is pretty so-so (exhibit A: the HoloLens built-in video camera) it does not exactly help if your training pictures are made with a high end DSLR, which shoots in great detail, handles adverse lighting conditions superbly, and never, ever has a blurry picture.

Conclusion

Three simple objects to call a remote Custom Vision Object Recognition Machine Learning model and translate its result into a 3D label. Basically a Vuforia-like application but then using 'artificial intelligence'  I love the way how Microsoft are taking the very thing they really excel in - democratizing and commoditizing complex technologies into usable tools - to the Machine Learning space.

The app I made is quite primitive, and it's also has a noticeable 'thinking moment' - since the model lives in the cloud and has to be accessed via a HTTP call. This is because the model is not a 'compact' model, therefore it's not downloadable and it's can't run on WindowsML. Wel will see what the future has in store for these kinds of models. But the app shows what's possible with these kinds of technologies, and it makes the prospect of a next version of HoloLens having an AI coprocessor all the more exiting!

Demo project - without the model, unfortunately - can be downloaded here.