Using dedicated deployments

Fireworks allows you to create dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:

  • Predictable performance unaffected by load caused by other users
  • No hard rate limits - but subject to the maximum load capacity of the deployment
  • Cheaper under high utilization
  • Access to larger selection of models not available via our serverless models
  • Custom base models (coming soon)

Need extra performance or want your dedicated deployment to be personally configured? Feel free to directly schedule time with our PM here (https://calendly.com/raythai)

Creating a dedicated deployment

To create a dedicated deployment, first import a base model into your account. This creates a new model name that can be used to route traffic to the deployment. This step only needs to be done for each base model.

firectl import model <MODEL_ID>

Make note of the name of the cloned model, it should look like accounts/<ACCOUNT_ID>/models/<MODEL_ID>. You will need this string when querying the deployment.See the "all models" list on our models page for a full list of available models and model IDs for dedicated deployments. Only text models are available for dedicated deployments.

Next, create a new deployment:

firectl create deployment <MODEL_ID> --wait

This command will complete when the deployment is READY. To let it run asynchronously, remove the --wait flag.

NOTE: The deployment ID is the last part of accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>.

You can verify the deployment is complete by running:

firectl get deployment <DEPLOYMENT_ID>
# OR
firectl get model <MODEL_ID>

The state field should show READY for the deployment and DEPLOYED for the model with the deployment ID set.

By default, the deployment will automatically delete itself if unused (i.e. no inference requests) for 1 hour. To disable auto-deletion, pass --unused-auto-delete-duration=0 flag to create/update.

Querying a model

Querying a model deployed to a dedicated deployment is the same as querying any other model. The model name will be the name of the cloned model or the PEFT addon you deploy. See the Querying text models for details.

curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/<ACCOUNT_ID>/models/<MODEL_ID>",
    "prompt": "Say this is a test"
}' \
  --url https://api.fireworks.ai/inference/v1/completions

Deleting a deployment

To delete a deployment, run:

firectl delete deployment <DEPLOYMENT_ID>

Fireworks also supports auto-deletion of unused deployments.

Deployment options

Auto-deletion for unused deployments

By default, the deployment will delete itself if unused (i.e. no received inference requests) for one hour or if you run into the spend limit for your account. To configure this automatic deletion duration, pass the --unused-auto-delete-duration flag to firectl create deployment or firectl update deployment. For example:

firectl create deployment <MODEL_ID> --unused-auto-delete-duration 1h
firectl update deployment <DEPLOYMENT_ID> --unused-auto-delete-duration 1h

Refer to time.ParseDuration for valid syntax for the duration string.

To disable auto-deletion, pass 0 for the duration.

World size / sharding (vertical scaling)

The number of GPUs used per replica is specified by passing the --world-size flag. Increasing the world size will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.

firectl create deployment <MODEL_ID> --world-size 2
firectl update deployment <DEPLOYMENT_ID> --world-size 2

Replica count (horizontal scaling)

The number of replicas (horizontal scaling) is specified by passing the --min-replica-count and --max-replica-count flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. Setting --max-replica-count to be higher than --min-replica-count will enable automatic scaling between the two replica counts based on load (batch occupancy). The default value for --min-replica-count is 1. The default value for --max-replica-count is the value of --min-replica-count. For example:

firectl create deployment <MODEL_ID> \
  --min-replica-count 2 \
  --max-replica-count 3
firectl update deployment <DEPLOYMENT_ID> \
  --min-replica-count 2 \
  --max-replica-count 3

Setting --min-replica-count=0 will scale the deployment down to 0 replicas after 1 hour of no traffic. While the deployment has 0 replicas, any new requests will scale it back up to 1 replica. There may be a 1 or 2 minute latency for requests made while the deployment is scaling from 0 to 1 replicas.

Deploying PEFT addons

See Deploying fine-tuned models for instructions on how to upload PEFT addons. To deploy a PEFT addon to a dedicated deployment, pass the --deployment-id flag to firectl deploy. For example:

firectl deploy <MODEL_ID> --deployment-id <DEPLOYMENT_ID>

The base model of the deployment must match the base model of the addon.

Available base models

The list of available base models can be found on our models page. You can find the model ID by clicking on the "Deploy" button.

Hardware

We currently only offer NVIDIA A100 80 GB GPUs to Developer and Business accounts. NVIDIA H100 80 GB GPUs are available to Enterprise accounts.

Pricing

Dedicated deployments are billed by GPU-second. Consult our pricing page for details.