Overview

Model serving overview

Multi-framework serving

Kubeflow provides two supported open source model serving systems that allow multi-framework model serving: KFServing and Seldon Core. You should choose the framework that best supports your model serving requirements. A rough comparison between KFServing and Seldon Core is shown below:

Feature sub-feature KFServing Seldon
Framework TensorFlow x x
XGBoost x x
scikit-learn x x
NVIDIA TensorRT Inference Server x x
ONNX x x
PyTorch x x
Graph Transformers x x
Combiners Roadmap x
Routers incl (MAB) Roadmap x
Analytics Explanations x x
Scaling Knative x
GPU AutoScaling x
HPA x x
Custom Container x x
Language Wrappers python, java, R
Multi-Container x
Rollout Canary x x
Shadow x
istio x x

Notes:

  • Both projects share technology including Explainability (via Seldon Alibi Explain) and Payload Logging amongst other areas.
  • A commercial product Seldon Deploy is available from Seldon that supports both KFServing and Seldon in production.
  • KFServing is part of the Kubeflow project ecosystem. Seldon is an external project supported within Kubeflow.

For further information:

TensorFlow Serving

For TensorFlow models you can use TensorFlow Serving for both real-time and batch prediction. Documentation is also provided on using TensorFlow serving via Istio. However, if you are thinking of utlizing multiple frameworks we would suggest you use KFServing or Seldon Core as described above.

NVIDIA TensorRT Inference Server

NVIDIA TensorRT Inference Server is a REST and GRPC service for deep-learning inferencing of TensorRT, TensorFlow and Caffe2 models. The server is optimized deploy machine and deep learning algorithms on both GPUs and CPUs at scale.

You can use NVIDIA TensorRT Inference Server standalone but we also recommend you look at using KFServing which includes support for NVIDIA TensorRT Inference Server.