1. Setting Up!

1.1 Install KFServing

일단 KFServing은 Kubeflow와 함께 설치가 되기 때문에.. 사실.. 뭐 딱히 더 해줄 필요는 없지만..
늘 그렇듯.. Kubeflow의 버젼 업그레이드가 느리기 때문에.. 나처럼 최신 버젼 사용하고 싶어하는.. <생략>

Version 목록 을 참조해서 원하는 KFSErving 버젼을 설치 합니다.

$ wget https://raw.githubusercontent.com/kubeflow/kfserving/master/install/v0.4.1/kfserving.yaml
$ kubectl apply -f kfserving.yaml

1.2 Install Local Gateway

먼저 Knative > 0.11.2 이상이 설치가 미리 되어 있어야 합니다.
cluster-local-gateway 는 필수적으로 필요하며, 해당 gateway는 transformer 그리고 explainer사용시 반드시 필요합니다.
하여튼 그냥 설치 해두면 됩니다.
자세한 내용은 Installing Istio for Knative 를 참고 합니다.

아래 yaml

cat << EOF > ./local-cluster-gateway.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      proxy:
        autoInject: enabled
      useMCP: false

  addonComponents:
    pilot:
      enabled: true
    prometheus:
      enabled: false

  components:
    ingressGateways:
      - name: cluster-local-gateway
        enabled: true
        label:
          istio: cluster-local-gateway
          app: cluster-local-gateway
        k8s:
          service:
            type: ClusterIP
            ports:
              - name: http
                protocol: TCP
                port: 80
                targetPort: 8080
              - name: https
                protocol: TCP
                port: 443
                targetPort: 8443
              - name: status-port
                port: 15021
                targetPort: 15021
              - name: tls
                port: 15443
                targetPort: 15443
EOF
$ istioctl manifest generate -f local-cluster-gateway.yaml > manifest.yaml
$ kubectl apply -f manifest.yaml
$ kubectl get svc cluster-local-gateway -n istio-system 
NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                              AGE
cluster-local-gateway   ClusterIP   10.100.243.162   <none>        80/TCP,443/TCP,15021/TCP,15443/TCP   69s

1.3 Serving namespace 지정

Kubeflow에서는 이미 KFServing 이 설치되어서 나옵니다.

$ kubectl create namespace kfserving
$ kubectl label namespace kfserving istio-injection=enabled
$ kubectl label namespace kfserving serving.kubeflow.org/inferenceservice=enabled
$ kubectl get ns kfserving -o json | jq .metadata.labels
{
  "istio-injection": "enabled",
  "serving.kubeflow.org/inferenceservice": "enabled"
}

KFServing controller 가 설치되어 있는지 확인합니다.

# Kuberflow 설치시
$ kubectl get po -n kubeflow | grep kfserving-controller-manager
kfserving-controller-manager-0        2/2     Running   0          3h55m

# KFServing 단독 설치시
$ kubectl get po -n kfserving-system | grep kfserving-controller-manager
kfserving-controller-manager-0        2/2     Running   0          56s

2. Getting Started

2.1 Iris Tutorial

2.1.1 Model Deployment

cat <<EOF > sklearn.yaml 
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  default:
    predictor:
      sklearn:
        storageUri: "gs://kfserving-samples/models/sklearn/iris"
EOF

데이터 생성

cat <<EOF > iris-input.json
{
  "instances": [
    [5.0,  3.4,  1.5,  0.2],
    [6.7,  3.1,  4.4,  1.4],
    [6.1,  3.0,  4.9,  1.8]
  ]
}
EOF

배포

# sklearn InferenceService를 배포합니다. 
$ kubectl apply -f sklearn.yaml -n kfserving

# Service URL 을 확인합니다. (URL 뜨는데 까지 약 20~30초 걸림)
$ kubectl get inferenceservices sklearn-iris -n kfserving
NAME           URL                                                                READY   DEFAULT TRAFFIC
sklearn-iris   http://sklearn-iris.kfserving.example.com/v1/models/sklearn-iris   True    100            

AWS EKS는 ingress 를 외부 연결로 쓰지 않고 따로 LoadBalancer를 지정해줘야 합니다. ㅜㅜ 개불편.
(GCP는 ingress 설정하면 알아서 load balancer 잡힘)

$ kubectl get svc istio-ingressgateway -n istio-system
NAME                   TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        
istio-ingressgateway   NodePort   10.100.200.170   <none>        15020:30234/TCP <생략>

필요한건 NodePort를 LoadBalancer로 변경해주면 됩니다.

$ kubectl edit svc istio-ingressgateway  -n istio-system

type: NodePorttype: LoadBalancer로 변경해줍니다.
이후 다시 service를 확인해보면 EXTERNAL-IP가 잡혀있게 됩니다.

$ kubectl --namespace istio-system get service istio-ingressgateway
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP                        PORT(S)                    
istio-ingressgateway   LoadBalancer   10.100.200.170   ****.us-east-2.elb.amazonaws.com   80:32051/TCP,443:31101/TCP

2.1.2 Inference from External Source

  • INGRESS_HOST: Load Balancer Hostname (ex. a0e3184ae3-1490218815.us-east-2.elb.amazonaws.com)
  • INGRESS_PORT: Load Balancer Port (ex. 80)
  • SERVICE_HOSTNAME: 모델 서빙되고 있는 주소 (ex. sklearn-iris.kfserving.example.com)

환경변수 설정

$ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
$ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
$ SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kfserving -o jsonpath='{.status.url}' | cut -d "/" -f 3)

Inference

$ curl -v -H  "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict -d @./iris-input.json
{"predictions": [0, 1, 2]}

PostMan

  • URL: {INGRESS_HOST}:{INGRESS_PORT}/v1/models/sklearn-iris
  • Headers 추가
    • Host: {SERVICE_HOSTNAME}
  • Body에 JSON형식으로 데이터 추가

2.1.3 Inference from Local Cluster Gateway

Cluster 내부에서의 통신은 위에처럼 외부 load balancer를 타서 통신을 할 필요가 없습니다.
즉 내부 통신을 이용하면 빠르게 데이터 교환을 할 수 있습니다.

먼저 내부에서 통신할 URL을 알아냅니다.

$ kubectl get inferenceService -n kfserving sklearn-iris -o jsonpath='{.status.address.url}' 
http://sklearn-iris.kfserving.svc.cluster.local/v1/models/sklearn-iris:predict

이제 특정 Container로 접속합니다.
아래는 예제 이며, “sklearn-iris-predictor-default-***” 요 부분은 pod 이름입니다.

$ kubectl exec -it sklearn-iris-predictor-default-*** -n kfserving  -c kfserving-container /bin/bash
$ curl -i http://sklearn-iris.kfserving.svc.cluster.local/v1/models/sklearn-iris:predict -d @./iris-input.json
{"predictions": [0, 1, 2]}

2.1.4 Performance Test

vi perf.yaml 로 다음을 입력합니다.
POST쪽에서 URL을 수정해줘야 합니다.

cat <<EOF > perf.yaml
apiVersion: batch/v1
kind: Job
metadata:
  generateName: load-test
spec:
  backoffLimit: 6
  parallelism: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: OnFailure
      containers:
      - args:
        - vegeta -cpus=5 attack -duration=1m -rate=500/1s -targets=/var/vegeta/cfg
          | vegeta report -type=text
        command:
        - sh 
        - -c
        image: peterevans/vegeta:latest
        imagePullPolicy: Always
        name: vegeta
        volumeMounts:
        - mountPath: /var/vegeta
          name: vegeta-cfg
      volumes:
      - configMap:
          defaultMode: 420
          name: vegeta-cfg
        name: vegeta-cfg
---
apiVersion: v1
data:
  cfg: |
    POST http://sklearn-iris.kfserving.ai.platform/v1/models/sklearn-iris:predict
    @/var/vegeta/payload
  payload: |
    {
      "instances": [
        [5.0,  3.4,  1.5,  0.2],
        [6.7,  3.1,  4.4,  1.4],
        [6.1,  3.0,  4.9,  1.8]
      ]
    }
kind: ConfigMap
metadata:
  annotations:
  name: vegeta-cfg
EOF
$ kubectl create -f perf.yaml -n kfserving 
job.batch/load-testpk9r2 created
configmap/vegeta-cfg created
$ kubectl logs load-testpk9r2-wmknb -n kfserving 
Requests      [total, rate, throughput]         30000, 500.02, 0.00
Duration      [total, attack, wait]             1m0s, 59.998s, 5.806ms
Latencies     [min, mean, 50, 90, 95, 99, max]  49.359µs, 3.64ms, 3.202ms, 5.691ms, 7.267ms, 12.156ms, 102.735ms
Bytes In      [total, mean]                     0, 0.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           0.00%
Status Codes  [code:count]                      0:30000  
Error Set:
Post "http://sklearn-iris.kfserving.ai.platform/v1/models/sklearn-iris:predict": dial tcp: lookup sklearn-iris.kfserving.ai.platform on 10.100.0.10:53: no such host

2.2 InferenceService with Custom Image

Flask App

cat <<EOF > app.py
import os
from flask import Flask

app = Flask(__name__)

@app.route('/v1/models/custom-image:predict')
def hello_world():
    greeting_target = os.environ.get('GREETING_TARGET', 'World')
    return 'Hello {}!\n'.format(greeting_target)

if __name__ == "__main__":
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))
EOF

requirements.txt

cat <<EOF > requirements.txt
Flask==1.1.1
gunicorn==20.0.4
EOF

Dockerfile

cat <<EOF > Dockerfile
FROM python:3.7-slim

ENV APP_HOME=/app
WORKDIR \$APP_HOME
COPY app.py requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
CMD exec gunicorn --bind :\$PORT --workers 1 --threads 8 app:app
EOF

Docker Build

  • 아래에 andersonjo 라고 되어 있는 부분은 Docker Hub의 ID를 넣어주시면 됩니다.
$ docker login
$ docker build -t andersonjo/custom-image .
$ docker run -d --name custom-test -p 8080:8080 -it test
$ docker push andersonjo/custom-image

Custom YAML

cat <<EOF > custom.yaml 
apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: custom-image
spec:
  default:
    predictor:
      custom:
        container:
          name: custom
          image: andersonjo/custom-image
          env:
            - name: GREETING_TARGET
              value: "Python KFServing Sample"
EOF

Deployment

$ kubectl apply -f custom.yaml -n kfserving
$ kubectl get inferenceservices -n kfserving
NAME           URL                                                                READY
custom-image   http://custom-image.kfserving.example.com/v1/models/custom-image   True 

Inference

$ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
$ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
$ SERVICE_HOSTNAME=$(kubectl get inferenceservice custom-image -n kfserving -o jsonpath='{.status.url}' | cut -d "/" -f 3)
$ curl -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/custom-image:predict
Hello Python KFServing Sample!

2.3 Autoscale InferenceService

2.3.1 Create Inference Service

autoscale.yaml inference service 를 생성합니다.

cat <<EOF > autoscale.yaml
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers" 
EOF
cat <<EOF > input.json
{  
    "instances":[  
       {  
          "image_bytes":{  
             "b64":""
          },
          "key":"   1"
       }
    ]
}
EOF
$ kubectl apply -f autoscale.yaml -n kfserving
$ kubectl get inferenceservices flowers-sample -n kfserving
NAME             URL                                                                    READY   DEFAULT
flowers-sample   http://flowers-sample.kfserving.ai.platform/v1/models/flowers-sample   True    100    

2.3.2 Stress Test

hey 명령어를 통해서 스트레스 테스트를 가볍게 해볼수 있습니다.

  • -z: duration이고 10s는 10초, 3m 은 3분
  • -c: 동시 requests 갯수. concurrent requests 이며 전체 requests갯수는 아님
  • -m: HTTP method. POST, GET, PUT, DELETE, HEAD, OPTIONS ..
$ MODEL_NAME=flowers-sample
$ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
$ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
$ SERVICE_HOSTNAME=$(kubectl get inferenceservice flowers-sample -n kfserving -o jsonpath='{.status.url}' | cut -d "/" -f 3)
$ hey -z 30s -c 5 -m POST -host ${HOST} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

결과

Summary:
  Total:	30.1505 secs
  Slowest:	0.4009 secs
  Fastest:	0.1899 secs
  Average:	0.1944 secs
  Requests/sec:	25.6712
  
  Total data:	133128 bytes
  Size/request:	172 bytes

Response time histogram:
  0.190 [1]	|
  0.211 [768]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.232 [0]	|
  0.253 [0]	|
  0.274 [0]	|
  0.295 [0]	|
  0.316 [0]	|
  0.338 [0]	|
  0.359 [0]	|
  0.380 [0]	|
  0.401 [5]	|


Latency distribution:
  10% in 0.1911 secs
  25% in 0.1916 secs
  50% in 0.1927 secs
  75% in 0.1944 secs
  90% in 0.1955 secs
  95% in 0.1963 secs
  99% in 0.2001 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0013 secs, 0.1899 secs, 0.4009 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0070 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:	0.1929 secs, 0.1897 secs, 0.2022 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0002 secs

Status code distribution:
  [404]	774 responses