Link Search Menu Expand Document


Table of contents

  1. Overview
  2. Configuration
  3. Output
  4. Benchmarks
    1. VOXL
    2. VOXL 2
  5. Using Your Own Models
    1. Post-Training Quantization
    2. Implementing your Model in voxl-tflite-server
  6. Running Multiple Models
  7. Demo
  8. Source


voxl-tflite-server is a harware accelerated TensorFlow Lite environment that can run as a a background systemd service. This server relies on the custom voxl-tflite package, which provides access to different accelerators specific to your hardware platform.



voxl-tflite-server includes a helper script, voxl-configure-tflite, which will walk you through the available default configurations depending on your platform. This script will generate a config file at /etc/modalai/voxl-tflite-server.conf that will be used on initialization.

Available Params:

  • skip_n_frames - how many frames to skip between processed frames. For 30Hz input frame rate, we recommend skipping 5 frame resulting in 5hz model output. For 30Hz/maximum output, set to 0.
  • model - which model to use.
  • input_pipe - which camera to use (tracking, hires, or stereo).
  • delegate - optional hardware acceleration. If the selection is invalid for the current model/hardware, will quietly fall back to base cpu delegate.

Voxl 2 Additional Params:

  • allow_multiple - remove process handling and allow multiple instances of voxl-tflite-server to run. Enables the ability to run multiples models simultaneously.
  • output_pipe_prefix - if allow_multiple is set, create output pipes using default names (tflite, tflite_data) with added prefix. ONLY USED IF allow_multiple is set to true.


  • The model parameter can be changed to a custom model- it just expects the full path to a .tflite model file.
  • There are various delegates available depending on your platform, but they each require different quantization types for optimal performance (i.e. when targeting the gpu, we typically use floating point models, but for the npu we use integer models).


The output stream is setup as a normal camera pipe under /run/mpa/tflite. As such, it can be viewed with voxl-portal, converted to ROS with voxl_mpa_to_ros, and logged to disk with voxl-logger.

Each model will provide a different custom overlay, depending on the task it is doing. The overlays will all contain an fps counter + inference timer in the top left.


When running an object detection model, you will also see the pipe /run/mpa/tflite_data. This output pipe will provide some metadata about each object detected by the model. The format of this metadata is as follows:

// struct containing all relevant metadata to a tflite object detection
typedef struct ai_detection_t {
    uint32_t magic_number;
    int64_t timestamp_ns;
    uint32_t class_id;
    int32_t  frame_id;
    char class_name[BUF_LEN];
    char cam[BUF_LEN];
    float class_confidence;
    float detection_confidence;
    float x_min;
    float y_min;
    float x_max;
    float y_max;
} __attribute__((packed)) ai_detection_t;

The tflite_data output is also compatible with voxl_mpa_to_ros, or can be subscribed to with a custom MPA client of your own.


Stats were collected across 5000 inferences, using hires [640x480x3] camera as input.


ModelTaskAvg Cpu Inference(ms)Avg Gpu Inference(ms)Max Frames Per Second(fps)Input DimensionsSource
MobileNetV2-SSDliteObject Detection127.78ms21.82ms37.28560776[1,300,300,3]link
MobileNetV1-SSDObject Detection75.48ms64.40ms14.619883041[1,300,300,3]link


ModelTaskAvg Cpu Inference(ms)Avg Gpu Inference(ms)Avg NNAPI Inference(ms)Max Frames Per Second(fps)Input DimensionsSource
MobileNetV2-SSDliteObject Detection33.89ms24.68ms34.42ms34.86750349[1,300,300,3]link
EfficientNet Lite4Classifier115.30ms24.74ms16.42ms48.97159647[1,300,300,3]link
FastDepthMonocular Depth37.34ms18.00ms37.32ms45.45454546[1,320,320,3]link
Movenet SinglePose LightningPose Estimation24.58ms28.49ms24.61ms34.98950315[1,192,192,3]link
YoloV5Object Detection88.49ms23.37ms83.87ms36.53635367[1,320,320,3]link
MobileNetV1-SSDObject Detection19.56ms21.35ms7.72ms85.324232082[1,300,300,3]link

Using Your Own Models

The compatibility of a model with this server is also dependent upon your hardware platfrom. For Voxl, this server can run any Tensorflow lite model using the v2.2.3 supported opsets(TFLITE_BUILTINS and custom opsets). See the Tensorflow Guide for more information on opsets. Voxl 2 is dependent on the v2.8.0 opset.

Post-Training Quantization

Once you have a model that is ready to convert to tflite format, it is neccesary to do some post-training quantization. This process will enable hardware acceleration on the gpu and ensure it is compatible with the voxl-tflite-server.

A simple python script can be used for this process. Below is an example that will convert the frozen inference graph of a tensorflow object detection model to a tflite model using our standard options (float16 quantization for gpu target):

# i.e., pip install tensorflow==2.2.3
import tensorflow as tf

# if you have a saved model and not a frozen graph, see: 
# tf.compat.v1.lite.TFLiteConverter.from_saved_model()

# please check these by opening up your frozen graph/saved model in a tool like netron
converter =  tf.compat.v1.lite.TFLiteConverter.from_frozen_graph(
  graph_def_file = '/path/to/.pb/file/tflite_graph.pb', 
  input_arrays = ['normalized_input_image_tensor'],
  input_shapes={'normalized_input_image_tensor': [1,300,300,3]},
  output_arrays = ['TFLite_Detection_PostProcess', 'TFLite_Detection_PostProcess:1', 'TFLite_Detection_PostProcess:2', 'TFLite_Detection_PostProcess:3'] 

converter.use_experimental_new_converter = True
converter.allow_custom_ops = True
converter.target_spec.supported_types = [tf.float16]

tflite_model = converter.convert()
with'model_converted.tflite', 'wb') as f:

If you are using Voxl 2, the conversion process is the same but you can use the current version of the converter api and Tensorflow v2.8.0.

Implementing your Model in voxl-tflite-server

Once you have a model ready, voxl-tflite-server is ready for easy integration via the InferenceHelper class.

This class will set up basic model-specific parameters for you and provides generic inference and preprocessing functions that will apply to any valid model. However, if you are looking to alter the input image (beyond resizing) before using it for inference, you will need to make changes to the generic preprocess_image function.

Otherwise, you will likely just need to write your own postprocessing function. There are a few examples provided, each with a slightly different use case and different output tensor to handle. If you are unsure of the output format for your specific model, tools like Netron can be extremely useful for determing input/output specifications. Another great resource is the TFlite examples page, which describes some of the most common tensorflow lite tasks and how to implement them.

Running Multiple Models

On VOXL 2, we expose some extra parameters to enable running multiple instances of voxl-tflite-server simultaneously. Doing so will require the following configuration file modifications:

  • first set allow_multiple to true
  • then, to prevent overwriting pipe names, set the output_pipe_prefix to something unique (relative to the other instances of the tflite-server you will be running)
  • finally, you can start the server via a custom service file (see the base file for reference here) or via shell as normal
  • before starting another instance, update the output_pipe_prefix to prevent overwriting, and continue as before.

NOTE: The board will heat up much quicker when running multiple models. Adding an external fan if not in flight can help combat the extra heat.



Source code is available on Gitlab