24 November 2018

Using Azure Custom Vision Object Recognition and HoloLens to identify and label objects in 3D space


HoloLens is cool, Machine Learning is cool, what's more fun than combine these two great techniques. Very recently you could read "Back to the future now: Execute your Azure trained Machine Learning models on HoloLens!"  on the AppConsult blog, and as early as last May my good friend Matteo Pagani wrote on the same blog about his very first experiments with WindowsML - as the technology to run machine learning models on your Windows ('edge') devices is called. Both of the blog posts use an Image Classification algorithm, which basically tells you whether or not an object is in the image, and what the confidence level of this recognition is.

And then this happened:

image"Object Detection finds the location of content within an image" is the definition that pops up if you hover your mouse over the (i) symbol behind "Project Types". So not only do you get a hit and a confidence level but also the location in the image where the object is.

Now things are getting interesting. I wondered if I could use this technique to detect objects in the picture and then use HoloLens' depth camera to actually guestimate where those object where in 3D space.

The short answer: yes. It works surprisingly good.


The global idea

  • User air taps to initiate the process
  • The HoloLens takes a quick picture and uploads the picture to the Custom Vision API
  • HoloLens gets the recognized areas back
  • Calculates the center of each area with a confidence level < 0.7
  • 'Projects' these centers on a plane 1 m wide and 0.56 high that's 1 meter in front of the Camera (i.e. the user's viewpoint)
  • 'Shoots' rays from the Camera through the projected center points and checks if and where the strike the Spatial Map
  • Places labels on the detected points (if any).

Part 1: creating and training the model

Matteo already wrote about how simple it actually is to create an empty model in CustomVision.ai so I skip that part. Inspired by his article I wanted to recognize airplanes as well, but I opted for model airplanes - much easier to test with than actual airplanes. So I dusted off all the plastic airplane models I had built during my late teens - this was a thing shy adolescent geeks like me sometimes did, back in the Jurassic when I grew up ;) - it helped we did not have spend 4 hours per day on social media ;). But I digress. I took a bunch of pictures of them:


And then, picture by picture, I had to mark and label the areas which contains the desired objects. This is what is different from training a model for 'mere' object classification: you have to mark every occasion of your desired object.


This is very easy to do, it's a bit boring and repetitive, but learning stuff takes sacrifices, and in the end I had quite an ok model. You train in it just the same way as Matteo already wrote about - by hitting big green 'Train'  button that's kind of hard to miss on the top right.

When you are done, you will need two things:

  • The Prediction URL
  • The Prediction key.

You can get those by clicking the "Performance" tab on top:


Then click the "Prediction URL" tab


And this will make this popup appear with the necessary information


Part 2: Building the HoloLens app to use the model


The app is basically using three main components:

  • CameraCapture
  • ObjectRecognizer
  • ObjectLabeler

They sit in the Managers object and communicate using the Messenger that I wrote about earlier.

Part 2a: CameraCapture gets a picture - when you air tap

imageIt's not exactly clear who originally came up with a saying like "great artist steal" but although I don't claim any greatness I do steal. CameraCapture is a slightly adapted version of this article in the Unity documentation. There are only a few changes. The original always captures the image in the "BRA32" format as this can be used as texture on a plane or quad. Unfortunately that is not a format CustomVision accepts. The app does show the picture it takes before the user's eye if the DebugPane property is set to a game object (in the demo project it is). Should you not desire this, simply clear the "Debug Pane" field in the "Camera Capture" script in the Unity editor.

So what you basically see is that CameraCapture takes a picture in a format based upon whether or not the DebugPane is present:

 pixelFormat = _debugPane != null ? CapturePixelFormat.BGRA32 : CapturePixelFormat.JPEG

and then either directly copies the captured (JPEG) photo into the photoBuffer, or it shows in on the DebugPane and as BRA32 and converts it to JPEG from there

void OnCapturedPhotoToMemory(PhotoCapture.PhotoCaptureResult result, 
PhotoCaptureFrame photoCaptureFrame) { var photoBuffer = new List<byte>(); if (photoCaptureFrame.pixelFormat == CapturePixelFormat.JPEG) { photoCaptureFrame.CopyRawImageDataIntoBuffer(photoBuffer); } else { photoBuffer = ConvertAndShowOnDebugPane(photoCaptureFrame); } Messenger.Instance.Broadcast( new PhotoCaptureMessage(photoBuffer, _cameraResolution, CopyCameraTransForm())); // Deactivate our camera _photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode); }

The display and conversion is done this way:

private List<byte> ConvertAndShowOnDebugPane(PhotoCaptureFrame photoCaptureFrame)
    var targetTexture = new Texture2D(_cameraResolution.width, 

    _debugPane.GetComponent<Renderer>().material.mainTexture = targetTexture;
    return new List<byte>(targetTexture.EncodeToJPG());

It creates a texture, uploads the buffer into it, destroys the current texture and sets the new texture. Then the object game object is actually being displayed, and then it's used to convert the image to JPEG

Either way, the result is a JPEG, and the buffer contents are sent on a message, together with the camera resolution and a copy of the Camera's transform. The resolution we need to calculate the height/width ratio of the picture, and the transform we need to retain because in between the picture being taken and the result coming back the user may have moved. Now you can't just send the Camera's transform, when the user moves. So you have to send a 'copy', which is made by this rather crude method, using a temporary empty gameobject:

private Transform CopyCameraTransForm()
    var g = new GameObject();
    g.transform.position = CameraCache.Main.transform.position;
    g.transform.rotation = CameraCache.Main.transform.rotation;
    g.transform.localScale = CameraCache.Main.transform.localScale;
    return g.transform;

Part 2b: ObjectRecognizer sends it to CustomVision.ai and reads results

The ObjectRecognizer is, apart from some song and dance to pick the message apart and start a Coroutine, a fairly simple matter. This part does all the work:

private IEnumerator RecognizeObjectsInternal(IEnumerable<byte> image, 
    Resolution cameraResolution, Transform cameraTransform)
    var request = UnityWebRequest.Post(_liveDataUrl, string.Empty);
    request.SetRequestHeader("Prediction-Key", _predictionKey);
    request.SetRequestHeader("Content-Type", "application/octet-stream");
    request.uploadHandler = new UploadHandlerRaw(image.ToArray());
    yield return request.SendWebRequest();
    var text = request.downloadHandler.text;
    var result = JsonConvert.DeserializeObject<CustomVisionResult>(text);
    if (result != null)
        result.Predictions.RemoveAll(p => p.Probability < 0.7);
        Debug.Log("#Predictions = " + result.Predictions.Count);
            new ObjectRecognitionResultMessage(result.Predictions, 
            cameraResolution, cameraTransform));
        Debug.Log("Predictions is null");

You will need to set _liveDataUrl and predictionKey values via the editor, as you could see in the image just below the Part 2a header. This behaviour creates a web request to the prediction URL, adds the prediction key as header, and the right content type. The body content is set to the binary image data using an UploadHandlerRaw. And then the request is sent to CustomVision.ai. The result is then deserialized into a CustomVisionResult object, all the predictions with a probability lower than the 0.7 threshold are removed, and the predications are put back into a message, to be sent to the ObjectLabeler, together once again with the camera's resolution and transform.

A little note: the CustomVisionResult together with all the classes it uses are in the CustomVisionResult.cs file in the demo project. This code was generated by first executing executing the SendWebRequest and then copying the raw output of "request.downloadhandler.text" into QuickType. It's an ideal site to quickly make classes for JSON serialization.

Interestingly to note here is that Custom Vision returns bounding boxes by giving top,left, width and height - in values that are always between 0 and 1. So if the top/left of your picture sits at (0,0) it's all the way to the top/left of the picture, and (1,1) is a the bottom right of the picture. Regardless of the height/with ratio of your picture. So if your picture is not square (and most cameras don't create square pictures)) you need to know the actual width and height of your picture - that way, you can calculate what pixel coordinates actually correspond to the numbers Custom Vison returns. And that's exactly what the next step does.

Part 2c: ObjectLabeler shoots for the Spatial Map and places labels

The ObjectLabeler also contains pretty little code as well, although the calculations may need a bit of explanation. The central piece of code is this method:

public virtual void LabelObjects(IList<Prediction> predictions, 
    Resolution cameraResolution, Transform cameraTransform)
    var heightFactor = cameraResolution.height / cameraResolution.width;
    var topCorner = cameraTransform.position + cameraTransform.forward -
                    cameraTransform.right / 2f +
                    cameraTransform.up * heightFactor / 2f;
    foreach (var prediction in predictions)
        var center = prediction.GetCenter();
        var recognizedPos = topCorner + cameraTransform.right * center.x -
                            cameraTransform.up * center.y * heightFactor;

        var labelPos = DoRaycastOnSpatialMap(cameraTransform, recognizedPos);
        if (labelPos != null)
            _createdObjects.Add(CreateLabel(_labelText, labelPos.Value));

    if (_debugObject != null)


First, we clear any labels that might have been created in a previous run. Then we calculate the height/width ratio of the picture (this is 2048x1152, so heightFactor will always be 0.5625, but why hard code something that can be calculated). Then comes the first interesting part. Remember that I wrote we are projecting the picture on a plane 1 meter before the user. We do this because the picture then looks pretty much live sized. So we need to go forward 1 meter from the camera position:

cameraTransform.position + cameraTransform.forward.normalized

But then we end up in the center of the plane. We need to get to the top left corner as a starting point. So we go half a meter to the left (actually, -1 * right, which amounts to left), then half the height factor up.

cameraTransform.up * heightFactor / 2f

In image, like this:


Once we are there, we calculate the center of the prediction using a very simple extension method:

public static Vector2 GetCenter(this Prediction p)
    return new Vector2((float) (p.BoundingBox.Left + (0.5 * p.BoundingBox.Width)),
        (float) (p.BoundingBox.Top + (0.5 * p.BoundingBox.Height)));

To find the actual location on the image, we basically use the same trick again in reverse: first move to the right the amount the x is from the top corner

var recognizedPos = topCorner + cameraTransform.right * center.x

And then a bit down again (actually , -up) using the y value scaled for height.

-cameraTransform.up * center.y * heightFactor;

Then we simply do a ray cast to the spatial map from the camera position through the location we calculated, basically shooting 'through' the picture for the real object.

private Vector3? DoRaycastOnSpatialMap(Transform cameraTransform, 
                                       Vector3 recognitionCenterPos)
    RaycastHit hitInfo;

    if (SpatialMappingManager.Instance != null && 
                       (recognitionCenterPos - cameraTransform.position), 
            out hitInfo, 10, SpatialMappingManager.Instance.LayerMask))
        return hitInfo.point;
    return null;

and create the label at the right spot. I copied the code for creating the label from two posts ago, so I will skip repeating that here.

There is little bit I want to repeat here

if (_debugObject != null)


If the debug object is set (that is to say, the plane showing the photo HoloLens takes to upload) it will be turned off here otherwise it obscures the actual labels. But more importantly is the last line: I created the copy of the camera's transform using a temporary game object. As the user keeps on shooting pictures those will add up and clutter the scene. So after the work is done, I clean it up.

And the result...

The annoying thing is, al always, I can't show you a video the whole process as any video recording stops as soon as the app takes a picture. So the only think I can show you is this kind of doctored video - I restarted video immediately after taking the picture, but I miss the part of where the actual picture is floating in front of the user. This is how it looks like, though, if you disable the debug pane from the Camera Capture script:

Lessons learned

  • There is a reason why Microsoft says you need at least 50 pictures for a bit reliable recognition. I took about 35 pictures of about 10 different models of airplanes. I think I should have take more like 500 pictures (50 of every type of model airplanes) and then things would have gone a lot better. Nevertheless, it already works pretty well
  • If the camera you use is pretty so-so (exhibit A: the HoloLens built-in video camera) it does not exactly help if your training pictures are made with a high end DSLR, which shoots in great detail, handles adverse lighting conditions superbly, and never, ever has a blurry picture.


Three simple objects to call a remote Custom Vision Object Recognition Machine Learning model and translate its result into a 3D label. Basically a Vuforia-like application but then using 'artificial intelligence'  I love the way how Microsoft are taking the very thing they really excel in - democratizing and commoditizing complex technologies into usable tools - to the Machine Learning space.

The app I made is quite primitive, and it's also has a noticeable 'thinking moment' - since the model lives in the cloud and has to be accessed via a HTTP call. This is because the model is not a 'compact' model, therefore it's not downloadable and it's can't run on WindowsML. Wel will see what the future has in store for these kinds of models. But the app shows what's possible with these kinds of technologies, and it makes the prospect of a next version of HoloLens having an AI coprocessor all the more exiting!

Demo project - without the model, unfortunately - can be downloaded here. 

05 November 2018

Adjusting and animating HoloLens/Mixed Reality holograms using Unity animations


Of course you scan script literally all animations using (something like) LeanTween, but you can also animate things using Unity animations. I have been using it primarily for basic repetitive animations, like the spinning of aircraft propellers or and helicopter rotors in AMS HoloATC. Sometimes models come with built-in animation, sometimes they don't. You can add it yourself, with some fiddling around.


First, a model...

I wanted to show the model I used for AMS HoloATC, but I could not find it anymore. The trouble with Asset Stores is that people may add models as they see fit, but can also can remove them again. So for this sample I took another helicopter - this free model of a Aerospatiale 342 Gazelle.

..then a project...

This is the usual stuff:

  • import the Mixed Reality toolkit ,
  • configure the scene, project and capability settings
  • then import the model into your project.

... and then we find the rotor components

We drag the helicopter inside the Hierarchy (it will appear at 0,0,0 with rotation 0,0,0) and we rotate the view so that we look on top of it. We want the rotor to animate, so we will need to find out which of the components make up the rotor. If you click a rotor blade once it will select the whole helicopter, but if you click it again, the hierarchy will jump to the actual sub component making up a rotor blade.


So the blade pointing to top/left  is "Component#5_001". The other rotor blades are "Component#5_002" (pointing right) and "Component#5_003" (pointing bottom/left). We also identify the top of the rotor, which is component "Component#9_001"


What you now need to do is create an empty game object "Rotor" inside the helicopter game object and drag the four components inside the Rotor game object. Unity will warn you that you are breaking things.


but in this case we don't care.


Done! Now we can rotate the rotor. But the observant looker has already spotted there is a problem, that will become apparent it we set the Y rotation of new "Rotor" object to for instance 150


Great. The pivot point of the rotor - the point where the red and blue arrows hit the green square - is apparently not the visual center. This seems to happen rather often with imported models. I am not quite sure what causes it, but I know how you can fix it. And I am going to show it, too ;).

Some advanced fiddling to make the visual center the pivot point

First of all, make sure the Tool Handle Position is set to Pivot:


You will find this at the top left of the scene window.

Set Rotor Y rotation back to 0, create an empty game object "InnerRotor" inside "Rotor" and drag all the components inside InnerRotor. Like this:


.. and then you select the Rotor component, and press ALT+D, duplicating the Rotor component.

Then you select the Rotor component again. If you view the Pivot Point - actually sporting three arrows in this view - you will see it's quite a bit from where we want the center of the rotor to be. You will need to move that point manually to where the visual center of the Rotor is. The copy of the Rotor will help you identify that point.

It takes quite some fiddling to get it right. After a few minutes of playing around, I came to these values:


... but now the actual visual rotor is floating high about the helicopter!


This is where the InnerRotor object is for. For X/Y/Z values enter the exact negative values of Rotor, so


And boom. The Rotor falls once again on the helicopter. And now if you set the Y rotation for Rotor to 150:


You can check if the rotor stays in place by selecting the Y rotation textbox and click-and-drag over that, the rotation will then change and you get view like the rotor is actually rotating a bit.

If you do this yourself on another hologram and the rotor still does not stay in the center while rotating, set InnerRotor position values back to 0, and fiddle a bit more till it fits. It also help to make the total model bigger (so the whole of the helicopter) while doing this. For some reason it's hard to zoom in on small models, but easy on big ones.

Once you are satisfied, you can delete or disable the Rotor (1) copy as we don't need it anymore. After you have done this, it is maybe a good moment to make a new prefab of your adapted helicopter.

And now - finally some animation

It took me quite some fiddling around to find the finer details of the timeline editor so I am writing a very detailed step-by-step guide.I am sure there are smarter ways to do this, but this is how I start:

  • I select the game object I want to animate
  • Then I click Window/Animation and that brings up this pane:


Default this window appears as a floating window. I just drag it in the bottom pane with the Game and Console windows.

Then I select the Create button. This prompts me to make an animation file, which I make in an Animation folder:


And then we get another button:


If we click "Add Property" we get this popup


Expand the Transform entry:


Then click the + behind "Rotation". This will add the Rotor rotation to the timeline. People who have ever used Blend will not suddenly sit up straight because they seem something familiar - I know I did!


Expand Rotor: Rotation


At the 1:00, click the top diamond, all the diamonds at the 1:00 mark will turn blue


And then hit the delete button on your keyboard. All diamonds at the 1:00 mark will disappear.

Now click at the timeline bar on top, at the 0:10 mark:


The time line will jump to 0:10 If you look at the inspector a the Rotor's properties, you will notice the properties for Rotation X/Y/Z have turned blue:


Change Y into 120 (it will turn red)


Now, and this is the tricky part: double click in the timeline editor at the place where the white vertical line intersects with an imaginary horizontal line through the "Rotation.y" property text:


X marks the spot ;). This should be the result:


Now click at the top bar again, at the 0:20 mark. Change the value of the Y rotation in the inspector to 240 and double-click at the imaginary intersection point again. Repeat for the 0:30 mark, here use value 360.

Then click the little play button on the Animation and the rotor will be spinning. You will noticed that the speed is a bit stuttering, but you can speed it up a little as by increasing the Samples displayed in this video below:

Job done. Now finally drag the Helicopter over the already created prefab, and you can create as many animated helicopters as you want. As soon as you hit Unity's "Play" button, all rotors will start spinning


The animator is powerful but not very intuitive at first, hence the step-by-step thing I wrote. It is pretty powerful though, especially for simple repetitive animations. I am sure you can do lots more with it. Be aware all this animations use a bit of performance, so spawning 15000 helicopters with spinning blades in a HoloLens may not be such a great idea. I think that will be true for helicopters without spinning rotors too, but that's not the point.

The demo project (containing 3 helicopters) called GetToTheChoppa ;) can be downloaded here.


02 November 2018

Responding to focus and showing data when tapping a hologram in Mixed Reality/HoloLens apps


One of the things you tend to forget as you progress into a field of expertise is how hard the first steps were. I feel that responding when gaze strikes a hologram or showing some data when an hologram is air tapped are fairly basic - but judging by the number of questions I get for "how to do this", this is apparently not so straightforward as it seems. So I decided to make a little sample that you can get from GitHub here.

Project setup

imageI created a new project in Unity. Then:

Now this model has some less than ideal size (as in being very big) so I fiddled a bit with the settings to get a more or less the view as displayed to the right


Now this model consist out a lot of sub objects, so this is nice model to interact with.

Creating interaction

What a hologram needs for interaction is pretty simple:

  • A collider (so the gaze cursor has something to strike against)
  • For an air tap to be intercepted: a behaviour that implements the interface IInputClickHandler
  • For registering focus (that is, the gaze cursor strikes it): a behaviour that implements IFocusable.

But this satellite has like 41 parts, and if we have to manually add a collider and 1 or two behaviors to that, it's a bit bothersome. That in this case, we can solve that by using a behaviour that sets that up for us. Mind you, that's not always possible. But for this simple sample we can.

The behaviour looks like this:

using UnityEngine;

public class InteractionBuilder : MonoBehaviour
    private GameObject _toolTip;
    void Start ()
        foreach (var child in GetComponentsInChildren<MeshFilter>())
            var displayer = child.gameObject.AddComponent<DataDisplayer>();
            displayer.ToolTip = _toolTip;

If simply walks find every MeshFilter child component, and adds an collider and a DataDisplayer to the game object where the MeshFiilter belongs to. The intention is to drag this on the main CommunicationSatellite object. But not right yet - because for this to work, we first need to create DataDisplayer, which is the behaviour implementing IInputClickHandler and IFocusable

Show tooltip on click - implementing IInputClickHandler

The first version of DataDisplayer looks like this:

using HoloToolkit.Unity.InputModule;
using HoloToolkit.UX.ToolTips;
using UnityEngine;

public class DataDisplayer : MonoBehaviour, IInputClickHandler
    public GameObject ToolTip;

    private GameObject _createdToolTip;

    public void OnInputClicked(InputClickedEventData eventData)
        if (_createdToolTip == null)
            _createdToolTip = Instantiate(ToolTip);
            var toolTip = _createdToolTip.GetComponent<ToolTip>();
            toolTip.ShowOutline = false;
            toolTip.ShowBackground = true;
            toolTip.ToolTipText = gameObject.name;
            toolTip.transform.position = transform.position + Vector3.up * 0.2f;
            toolTip.transform.parent = transform.parent;
            toolTip.AttachPointPosition = transform.position;
            toolTip.ContentParentTransform.localScale = new Vector3(0.05f, 0.05f, 0.05f);
            var connector = toolTip.GetComponent<ToolTipConnector>();
            connector.Target = _createdToolTip;
            _createdToolTip = null;

Now this may look a bit complicated, but most of it I just stole from the ToolTipSpawner class. Basically it only does this:

  • When the hologram-part is clicked (and OnInputClicked is called, which is a mandatory method when you implement IInputClickHandler) it checks if a tooltip already exists.
  • If not, it creates one a little above the clicked element
  • If a tooltip already exists, it's is deleted again.

This behaviour get it's  tooltip prefab handed from the InteractionBuilder. As I said, InteractionBuider should be dragged on the CommunicationSatellite root hologram, and now we have built our DataDisplayer, we can actually do so.


The Tooltip field needs to be filled by dragging the Tooltip prefab from HoloToolkit/Ux/sefabs on top of it


Now, if you tap on an element of the satellite, you will get tooltip data with the name of the element


Now in itself this is of course pretty much useless, but in stead of displaying the name directly you can also use the name or some other attribute of the Hologram part to reach out to a web service or a local data file, using that attribute as a key, fetch some data connected to that attribute and show that data. Is it a fairly commonly used pattern, but is up to you - and outside the scope of this blog - to have some data file or web service with connect.

Highlighting on focus - implementing IFocusable

It's not always easy to see which part of the satellite is hit by the gaze cursor, as they are both quite lightly colored. How about letting the whole part light up in red? We add the following code to DataDisplayer:

public class DataDisplayer : MonoBehaviour, IInputClickHandler, IFocusable
    private Dictionary<MeshRenderer, Color[]> _originalColors = 
new Dictionary<MeshRenderer, Color[]>(); void Start() { SaveOriginalColors(); } public void OnFocusEnter() { SetHighlight(true); } public void OnFocusExit() { SetHighlight(false); } }

This looks pretty simple

  • A start you save a hologram part's original colors
  • If the gaze strikes the hologram, set the highlight colors
  • If the gaze leaves, turn highlight off

It sometimes helps to make code self-explanatory and to write it like this.

So saving the original colors works like this - and it conveniently makes an inventory of the components inside each hologram-part and the materials they use:

private void SaveOriginalColors()
    if (!_originalColors.Any())
        foreach (var component in GetComponentsInChildren<MeshRenderer>())
            var colorList = new List<Color>();

            foreach (var t in component.materials)
            _originalColors.Add(component, colorList.ToArray());

Creating the highlight is not very complex now anymore

private void SetHighlight(bool status)
    var targetColor = Color.red;
    foreach (var component in _originalColors.Keys) 
        for (var i = 0; i < component.materials.Length; i++)
            component.materials[i].color = status ? 
                targetColor : _originalColors[component][i];

Basically we set the color of every material inside the the component to red if status is true - or to it's original color when it is false. And now, if the gaze cursor strikes part of our satellite:



And that's all there is to it. As I wrote, I would like to suggest showing else than the name of the hologram as that is not very interesting. Also, a bit more elaborate way of showing the data than using the the Tooltip might be considered. But the principle - implementing IInputClickHandler and IFocusable and acting on that - stays the same.

The finished demo project can be found here.