You shouldn’t do like this.
Model prediction is a backend as it can take. Here you transfer responsibilities for requests queue on system or on front http server like nginx. And most of the queue will be lost on load.
To make this production ready you need:
- Run background worker which is listening request queue.
- On request to front server it add params and data to the request queue.
- Worker get new task and run prediction. Results are saving in result queue.
- request from server wait until client ask for results(in ping style), or using realtime connections like websockets just send result right away to him
Thus you can see load in your queue, can add additional workers to handle load, all requests will be served.