Virtual Appliance Scaling
Transcription:BatchReal-TimeDeployments:Virtual ApplianceMulti Threaded Workers
Batch Mode
Transcription:BatchDeployments:Virtual ApplianceThe number of concurrent threads available to a job worker depends on the length of the file being transcribed. Workers can be assigned a single thread or multiple, depending on the setting of scaling_mode
in the API. scaling_mode
can take two values: simple
, meaning each transcription job runs in a single thread, or adaptive
, where the number of threads depends on the length of the audio.
Depending on the scaling mode and transcription features requested by the job, workers will reserve a specific amount of CPU and Memory resources. On job creation, if enough resources are available, Kubernetes will schedule the job, if not enough resources are available, jobs will be marked as pending until resources are freed.
curl -L -u admin:$PWD -X 'POST' \
"http://${APPLIANCE_HOST}/v2/management/host/scaling" \
-d '{"scaling_mode": "simple"}'
In adaptive mode, jobs run in parallel depending on their length, up to a maximum of 4 threads. For this reason, adaptive mode is only available if the node has at least 4 cores.
Length in Seconds | Threads |
---|---|
0 < s <= 300 | 1 |
300 < s <= 600 | 2 |
600 < s <= 900 | 3 |
900 < s <= max | 4 |
Since adaptive jobs use multiple threads, they also apply a greater load to the GPU Inference Server (if enabled).
As a result the max_jobs
configuration setting has been introduced to protect the Inference Server from being overwhelmed see GPU Configuration
Realtime Mode
Transcription:Real-TimeDeployments:Virtual ApplianceRealtime mode supports multi-threaded workers by default, the worker has a configurable number of streams (threads) it can process at a given time, see Realtime GPU configuration for more details. If the max number of threads is exceeded then the appliance will reject any new sessions.
The current number of active sessions can be retrieved via the sessions endpoint of the management api.
curl -L -u admin:$PWD -X 'GET' \
"http://${APPLIANCE_HOST}/v2/management/host/realtime/sessions"
This will return a json object.
request_ids
- a list of strings representing the request ids of active sessions
{
"request_ids": ["session_1_id", "session_2_id"],
}
When starting a new realtime session, it is best to check the readiness of the realtime service, this will report whether there is remaining capacity for a new session.
curl -L -u admin:$PWD -X 'GET' \
"http://${APPLIANCE_HOST}/v2/management/host/realtime/ready"
This will return a json object.
ready
-true/false
if true then the service has capacity for a new session elsemax_streams
has been reached.
{
"ready": true,
}