Skip to content

dudeperf3ct/15-mlserver-deploy

Repository files navigation

MLServer

MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. The list of cool features include

  • Adaptive batching, to group inference requests together on the fly.
  • Parallel Inference Serving, for vertical scaling across multiple models through a pool of inference workers.
  • Multi-model serving to run multiple models within the same process
  • Support for the standard V2 Inference Protocol on both the gRPC and REST flavours
  • Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe, where MLServer is the core Python inference server used to serve machine learning models.

Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. It also provides supports inference runtimes for many frameworks such as:

  1. Scikit-Learn
  2. XGBoost
  3. Spark MLib
  4. LightGBM
  5. Tempo
  6. MLflow
  7. Writing custom runtimes

In this exercise, we will deploy the sentiment analysis huggingface transformer model. Since MLServer does not provide out-of-the-box support for PyTorch or Transformer models, we will write a custom inference runtime to deploy this model.

pip install mlserver
# to install out-of-box frameworks
pip install mlserver-sklearn # or any of the frameworks supported above

Custom Inference Runtime

It's very easy to extend MLServer for any framework other than the supported ones by writing a custom inference runtime. To add support for our framework, we extend mlserver.MLModel abstract class and overload two main methods:

  • load(self) -> bool: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
  • predict(self, payload: InferenceRequest) -> InferenceResponse: Responsible for using a model to perform inference on an incoming data point.
class SentimentModel(MLModel):
    """
    Implementationof the MLModel interface to load and serve custom hugging face transformer models.
    """

    # load the model
    async def load(self) -> bool:

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        model_uri = await get_model_uri(self._settings)

        self.model_name = model_uri
        self.model = DistilBertForSequenceClassification.from_pretrained(
            self.model_name
        )
        self.model.eval()
        self.model.to(self.device)
        self.tokenizer = DistilBertTokenizer.from_pretrained(self.model_name)

        self.ready = True
        return self.ready

    # output predictions
    async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
        input_id, attention_mask = self._preprocess_inputs(payload)
        prediction = self._model_predict(input_id, attention_mask)

        return types.InferenceResponse(
            model_name=self.name,
            model_version=self.version,
            outputs=[
                types.ResponseOutput(
                    name="predictions",
                    shape=prediction.shape,
                    datatype="FP32",
                    data=np.asarray(prediction).tolist(),
                )
            ],
        )

    # preprocess input payload
    def _preprocess_inputs(self, payload: types.InferenceRequest):
        inp_text = defaultdict()
        for inp in payload.inputs:
            inp_text[inp.name] = json.loads(
                "".join(self.decode(inp, default_codec=StringCodec))
            )
        inputs = self.tokenizer(inp_text['text'], return_tensors="pt")
        input_id = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]
        return input_id, attention_mask

    # run inference
    def _model_predict(self, input_id, attention_mask):
        with torch.no_grad():
            outputs = self.model(input_id, attention_mask)
            probs = F.softmax(outputs.logits, dim=1).numpy()[0]
        return probs

Settings files

The next step will be to create 2 configuration files:

  • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
  • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

Run

Locally

Test the sentiment classifier model

docker build -t sentiment -f sentiment/Dockerfile.sentiment sentiment/
docker run --rm -it sentiment

Test MLServer locally

# download trained models
bash get_models.sh
# create a docker image
mlserver build . -t 'sentiment-app:1.0.0'
docker run -it --rm -p 8080:8080 -p 8081:8081 sentiment-app:1.0.0

In a separate terminal,

# test inference request (REST)
python3  test_local_http_endpoint.py
# test inference request (gRPC)
python3  test_local_http_endpoint.py

Additional Exercise

Releases

No releases published

Packages

No packages published