As mentioned in Inference Metrics, the inference performance might be impacted by
So it is impossible to allocate fixed computation resource for dynamic workloads like number of streams, inference model etc. You can use horizontal scaling approach to improve:
kubectl command or via dashboardYou can scale the deployment of inference engine via command:
kubectl scale --replicas=4 deploy/ei-infer-car-fp32-app
Or do it via kubernetes-dashboard as follows:

From above result, after replicas scale up from 1 to 4, the infer FPS was up 4 times and drop FPS was reduced much.
Please get detail about HPA(Horizontal POD Autoscaler) from kubernete offical document.
To enable it, you need install metrics-server
(Note: Please install via script install-metric-server.sh or refer below steps.)
git clone https://github.com/kubernetes-sigs/metrics-server
kubectl apply -f metrics-server/deploy/kubernetes/
After 1~5 minutes, you should able to get resource usage via following commands
kubectl top pod
If meeting with error on some kubernete like v1.16.0, please add mentioned here
diff --git a/deploy/kubernetes/metrics-server-deployment.yaml b/deploy/kubernetes/metrics-server-deployment.yaml
index e4bfeaf..87ca94f 100644
--- a/deploy/kubernetes/metrics-server-deployment.yaml
+++ b/deploy/kubernetes/metrics-server-deployment.yaml
@@ -33,6 +33,8 @@ spec:
args:
- --cert-dir=/tmp
- --secure-port=4443
+ - --kubelet-insecure-tls=true
+ - --kubelet-preferred-address-types=InternalIP
ports:
- name: main-port
containerPort: 4443
Auto scale need base on the specific metrics like CPU usage like below
kubectl autoscale deployment ei-infer-car-fp32-app --cpu-percent=80 --min=1 --max=4
or
cd cloud-native-demo/elastic_inference/kubernetes/monitoring
kubectl apply -f hpa-infer-car-fp32-on-metric-cpu.yaml
After few minutes (4~10minutes), the replicas of ei-infer-car-fp32-app will be scaled up, so inference FPS was improved and drop FPS was reduced.

(Note: This is an example, please change the deployment name from ei-infer-car-fp32-app to others if need)
(Note: The autoscaling version v1 in hpa-infer-car-fp32-on-metric-cpu.yaml was tested on kubernetes v1.16.0, please use kubectl api-versions to check your API version for autoscaling then change accordingly.)
CPU metric might not reflect the inference performance, so please refer HPA base on custom inference metics for advanced HPA feature.