Model serving using TRT Inference Server
NVIDIA TensorRT Inference Server is a REST and GRPC service for deep-learning inferencing of TensorRT, TensorFlow and Caffe2 models. The server is optimized deploy machine and deep learning algorithms on both GPUs and CPUs at scale.
These instructions detail how to set up a GKE cluster suitable for
running the NVIDIA TensorRT Inference Server and how to use the
io.ksonnet.pkg.nvidia-inference-server prototype to generate
Kubernetes YAML and deploy to that cluster.
Please refer to the Google Kubernetes Engine Cluster for Kubeflow guide for set up instructions.
The docker image for the NVIDIA TensorRT Inference Server is available on the NVIDIA GPU Cloud. Below you will add a Kubernetes secret to allow you to pull this image. As initialization you must first register at NVIDIA GPU Cloud and follow the directions to obtain your API key. You can confirm the key is correct by attempting to login to the registry and checking that you can pull the inference server image. See Pull an Image from a Private Registry for more information about using a private registry.
$ docker login nvcr.io Username: $oauthtoken Password: <your-api-key>
Now use the NVIDIA GPU Cloud API key from above to create a kubernetes
ngc. This secret allows Kubernetes to pull the
inference server image from the NVIDIA GPU Cloud registry. Replace
docker-username you specify the value
exactly as shown, including the backslash.
$ kubectl create secret docker-registry ngc --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<api-key> --docker-email=<ngc-email>
The inference server needs a repository of models that it will make available for inferencing. You can find an example repository in the open-source repo and instructions on how to create your own model repository in the NVIDIA Inference Server User Guide.
For this example you will place the model repository in a Google Cloud Storage bucket.
$ gsutil mb gs://inference-server-model-store
Following these instructions download the example model repository to your system and copy it into the GCS bucket.
$ gsutil cp -r model_store gs://inference-server-model-store
Next use ksonnet to generate Kubernetes configuration for the NVIDIA TensorRT Inference Server deployment and service. The –image option points to the NVIDIA Inference Server container in the NVIDIA GPU Cloud Registry. For the current implementation you must use the 18.08.1 container. The –modelRepositoryPath option points to our GCS bucket that contains the model repository that you set up earlier.
$ ks init my-inference-server $ cd my-inference-server $ ks registry add kubeflow https://github.com/kubeflow/kubeflow/tree/master/kubeflow $ ks pkg install kubeflow/nvidia-inference-server $ ks generate nvidia-inference-server iscomp --name=inference-server --image=nvcr.io/nvidia/inferenceserver:18.08.1-py2 --modelRepositoryPath=gs://inference-server-model-store/tf_model_store
Next deploy the service.
$ ks apply default -c iscomp
Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference service. In this case it is 188.8.131.52.
$ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE inference-se LoadBalancer 10.7.241.36 184.108.40.206 8000:31220/TCP,8001:32107/TCP,8002:31682/TCP 1m kubernetes ClusterIP 10.7.240.1 <none> 443/TCP 1h
The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001 and a Prometheus metrics endpoint on port 8002. You can use curl to get the status of the inference server from the HTTP endpoint.
$ curl 220.127.116.11:8000/api/status
Follow the instructions to build the inference server example image and performance clients. You can then use these examples to send requests to the server. For example, for an image classification model use the image_client example to perform classification of an image.
$ image_client -u 18.104.22.168:8000 -m resnet50_netdef -c3 mug.jpg Output probabilities: batch 0: 504 (COFFEE MUG) = 0.777365267277 batch 0: 968 (CUP) = 0.213909029961 batch 0: 967 (ESPRESSO) = 0.00294389552437
When done use
ks to remove the deployment.
$ ks delete default -c iscomp
If you create a cluster then make sure to also delete that.
$ gcloud container clusters delete myinferenceserver