PyTorch Training (PyTorchJob)

Using PyTorchJob to train a model with PyTorch

This page describes PyTorchJob for training a machine learning model with PyTorch.

PyTorchJob is a Kubernetes custom resource to run PyTorch training jobs on Kubernetes. The Kubeflow implementation of PyTorchJob is in training-operator.

Note: PyTorchJob doesn’t work in a user namespace by default because of Istio automatic sidecar injection. In order to get it running, it needs annotation sidecar.istio.io/inject: "false" to disable it for either PyTorchJob pods or namespace. To view an example of how to add this annotation to your yaml file, see the TFJob documentation.

Installing PyTorch Operator

If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.

By default, PyTorch Operator will be deployed as a controller in training operator.

If you want to install a standalone version of the training operator without Kubeflow, see the kubeflow/training-operator’s README.

Verify that PyTorchJob support is included in your Kubeflow deployment

Check that the PyTorch custom resource is installed:

kubectl get crd

The output should include pytorchjobs.kubeflow.org like the following:

NAME                                             CREATED AT
...
pytorchjobs.kubeflow.org                         2021-09-06T18:33:58Z
...

Check that the Training operator is running via:

kubectl get pods -n kubeflow

The output should include training-operaror-xxx like the following:

NAME                                READY   STATUS    RESTARTS   AGE
training-operator-d466b46bc-xbqvs   1/1     Running   0          4m37s

Creating a PyTorch training job

You can create a training job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

Deploy the PyTorchJob resource to start training:

kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l job-name=pytorch-simple -n kubeflow

Training takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l job-name=pytorch-simple,replica-type=master,replica-index=0 -o name -n kubeflow)
kubectl logs -f ${PODNAME} -n kubeflow

Monitoring a PyTorchJob

kubectl get -o yaml pytorchjobs pytorch-simple -n kubeflow

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-12-16T21:39:09Z
  generation: 1
  name: pytorch-tcp-dist-mnist
  namespace: default
  resourceVersion: "15532"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
  uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
spec:
  cleanPodPolicy: None
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
            name: pytorch
            ports:
            - containerPort: 23456
              name: pytorchjob-port
            resources: {}
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
            name: pytorch
            ports:
            - containerPort: 23456
              name: pytorchjob-port
            resources: {}
status:
  completionTime: 2018-12-16T21:43:27Z
  conditions:
  - lastTransitionTime: 2018-12-16T21:39:09Z
    lastUpdateTime: 2018-12-16T21:39:09Z
    message: PyTorchJob pytorch-tcp-dist-mnist is created.
    reason: PyTorchJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2018-12-16T21:39:09Z
    lastUpdateTime: 2018-12-16T21:40:45Z
    message: PyTorchJob pytorch-tcp-dist-mnist is running.
    reason: PyTorchJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2018-12-16T21:39:09Z
    lastUpdateTime: 2018-12-16T21:43:27Z
    message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
    reason: PyTorchJobSucceeded
    status: "True"
    type: Succeeded
  replicaStatuses:
    Master: {}
    Worker: {}
  startTime: 2018-12-16T21:40:45Z

Feedback

Was this page helpful?