A minimal high performance parallel neural network framework running on iOS

Zhihao Li (zhihaol) and Zhenrui Zhang (zhenruiz)

Checkpoint Report


According to Morgan Stanley Research, as of the year of 2011, half of the computing devices worldwide are mobile devices 1. The intelligent mobile applications are changing people’s lives. However a quite thorough survey, we find no fully functional deep neural network framework on iOS. Therefore, we want to implement our own.

This framework features well designed and easy to use API, and high performance parallel neural network implementation based on Metal.

With such framework, software engineers can easily train and test network on their iOS devices. This can potentially lead to many interesting applications. For example, an application that can recognize daily objects in real time without connection to internet. Or photo collection applications that can recognize all your friends based on personalized fine-tuning without any threat to privacy.

We envision a great future market opening for such framework.

The Challenge

The task of training and running neural networks on a iOS device is itself challenging.

Bearing so many challenges, this project is still promising. The task of training and testing neural networks is highly parallizable, as the computation inside a layer is independent (we currently don’t support intra-layer connections). The locality should be good as weights within the same layer should be stores adjacently. And typically there is not much divergence in the network training and testing - all the weights are updated at the same time within a layer.


We will start the project from scratch. The framework will be mainly running on iOS devices with limited support to OSX devices. We will use the high-level architecture of Caffe 7 as our reference.

We are in need of a apple developer account to test the framework on real devices.

Goals and Deliverables


In this project, we want to develop a Caffe-like deep neural network framework running on iOS/OSX devices, in both CPU and GPU, that provides usable primitives to

To achieve this, we will implement

We want our system to be usable in mobile devices, therefore, the performance goal would be to have a user acceptable memory, energy, computation cost and response time to train on a reasonably sized dataset, and to run a compressed model.


If we are ahead of schedule, we plan to port some other layers and optimizers from Caffe to our framework.


We will be demonstrating an application developed based on our framework. It could be a application to recognize things. Also, we will be comparing the CPU implementation and GPU implementation in terms of speedup and energy consumption.

Platform Choice

OSX and iOS based on Metal framework.


Time What we plan to do Status
April 1 ~ April 7 Revise proposal, study the design and architecture of Caffe, learn Swift language and Metal API, implement a simple App for testing, design interfaces for espresso DONE
April 8 ~ April 14 Develop and test the CPU version Finished development, need more thorough testing
April 15 ~ April 21 Develop and test the GPU version Finished development, need testing
April 22 ~ April 28 Run MNIST network(and test our implementations)  
April 29 ~ May 5 Run a compressed model trained by Caffe or other common frameworks  
May 6 ~ Parallel Competition Day Write final report and prepare for presentation  
  1. Huberty, K., Lipacis, C. M., Holt, A., Gelblum, E., Devitt, S., Swinburne, B., … & Chen, G. (2011). Tablet Demand and Disruption. Tablet.

  2. Kim, Yong-Deok, et al. “Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications.” arXiv preprint arXiv:1511.06530 (2015).

  3. Han, Song, Huizi Mao, and William J. Dally. “A deep neural network compression pipeline: Pruning, quantization, huffman encoding.” arXiv preprint arXiv:1510.00149 (2015).

  4. Chen, Wenlin, et al. “Compressing neural networks with the hashing trick.” arXiv preprint arXiv:1504.04788 (2015).

  5. Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. “Low precision arithmetic for deep learning.” arXiv preprint arXiv:1412.7024 (2014).

  6. Gupta, Suyog, et al. “Deep learning with limited numerical precision.” arXiv preprint arXiv:1502.02551 (2015).

  7. Jia, Yangqing, et al. “Caffe: Convolutional architecture for fast feature embedding.” Proceedings of the ACM International Conference on Multimedia. ACM, 2014.