MLServer

MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. The list of cool features include

Adaptive batching, to group inference requests together on the fly.
Parallel Inference Serving, for vertical scaling across multiple models through a pool of inference workers.
Multi-model serving to run multiple models within the same process
Support for the standard V2 Inference Protocol on both the gRPC and REST flavours
Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe, where MLServer is the core Python inference server used to serve machine learning models.

Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. It also provides supports inference runtimes for many frameworks such as:

In this exercise, we will deploy the sentiment analysis huggingface transformer model. Since MLServer does not provide out-of-the-box support for PyTorch or Transformer models, we will write a custom inference runtime to deploy this model.

pip install mlserver
# to install out-of-box frameworks
pip install mlserver-sklearn # or any of the frameworks supported above

Custom Inference Runtime

It's very easy to extend MLServer for any framework other than the supported ones by writing a custom inference runtime. To add support for our framework, we extend mlserver.MLModel abstract class and overload two main methods:

load(self) -> bool: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
predict(self, payload: InferenceRequest) -> InferenceResponse: Responsible for using a model to perform inference on an incoming data point.

class SentimentModel(MLModel):
    """
    Implementationof the MLModel interface to load and serve custom hugging face transformer models.
    """

    # load the model
    async def load(self) -> bool:

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        model_uri = await get_model_uri(self._settings)

        self.model_name = model_uri
        self.model = DistilBertForSequenceClassification.from_pretrained(
            self.model_name
        )
        self.model.eval()
        self.model.to(self.device)
        self.tokenizer = DistilBertTokenizer.from_pretrained(self.model_name)

        self.ready = True
        return self.ready

    # output predictions
    async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
        input_id, attention_mask = self._preprocess_inputs(payload)
        prediction = self._model_predict(input_id, attention_mask)

        return types.InferenceResponse(
            model_name=self.name,
            model_version=self.version,
            outputs=[
                types.ResponseOutput(
                    name="predictions",
                    shape=prediction.shape,
                    datatype="FP32",
                    data=np.asarray(prediction).tolist(),
                )
            ],
        )

    # preprocess input payload
    def _preprocess_inputs(self, payload: types.InferenceRequest):
        inp_text = defaultdict()
        for inp in payload.inputs:
            inp_text[inp.name] = json.loads(
                "".join(self.decode(inp, default_codec=StringCodec))
            )
        inputs = self.tokenizer(inp_text['text'], return_tensors="pt")
        input_id = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]
        return input_id, attention_mask

    # run inference
    def _model_predict(self, input_id, attention_mask):
        with torch.no_grad():
            outputs = self.model(input_id, attention_mask)
            probs = F.softmax(outputs.logits, dim=1).numpy()[0]
        return probs

Settings files

The next step will be to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

Run

Locally

Test the sentiment classifier model

docker build -t sentiment -f sentiment/Dockerfile.sentiment sentiment/
docker run --rm -it sentiment

Test MLServer locally

# download trained models
bash get_models.sh
# create a docker image
mlserver build . -t 'sentiment-app:1.0.0'
docker run -it --rm -p 8080:8080 -p 8081:8081 sentiment-app:1.0.0

In a separate terminal,

# test inference request (REST)
python3  test_local_http_endpoint.py
# test inference request (gRPC)
python3  test_local_http_endpoint.py

Additional Exercise

Deploy the MLServer application on SeldonCore or KServe.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
sentiment		sentiment
.dockerignore		.dockerignore
.gitignore		.gitignore
Readme.md		Readme.md
get_model.sh		get_model.sh
model-settings.json		model-settings.json
models.py		models.py
requirements.txt		requirements.txt
settings.json		settings.json
test_local_grpc_endpoint.py		test_local_grpc_endpoint.py
test_local_http_endpoint.py		test_local_http_endpoint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentiment

sentiment

.dockerignore

.dockerignore

.gitignore

.gitignore

Readme.md

Readme.md

get_model.sh

get_model.sh

model-settings.json

model-settings.json

models.py

models.py

requirements.txt

requirements.txt

settings.json

settings.json

test_local_grpc_endpoint.py

test_local_grpc_endpoint.py

test_local_http_endpoint.py

test_local_http_endpoint.py

Repository files navigation

MLServer

Custom Inference Runtime

Settings files

Run

Locally

Additional Exercise

About

Releases

Packages

Languages

dudeperf3ct/15-mlserver-deploy

Folders and files

Latest commit

History

Repository files navigation

MLServer

Custom Inference Runtime

Settings files

Run

Locally

Additional Exercise

About

Topics

Resources

Stars

Watchers

Forks

Languages