Querying text models

Fireworks.ai offers an OpenAI-compatible REST API for querying text models. There are several ways to interact with it:

Using the web console

All Fireworks models can be accessed through the web console at <https://fireworks.ai/>. Clicking on a model will take you to the playground where you can enter a prompt along with additional request parameters.

Non-chat models will use the completions API which passes your input directly into the model.

Models with a conversation config are considered chat models (also known as instruct models). By default, chat models will use the chat completions API which will automatically format your input with the conversation style of the model. Advanced users can revert back to the completions API by unchecking the "Use chat template" option.

Using the API

Chat Completions API

Models with a conversation config have the chat completions API enabled. These models are typically tuned with a specific conversation styles for which they perform best. For example Llama chat models use the following template:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_message_1} [/INST]

Some templates like llama-chat can support multiple chat messages as well. In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.

Here are some examples of calling the chat completions API:

from fireworks.client import Fireworks

client = Fireworks(api_key="<FIREWORKS_API_KEY>")
response = client.chat.completions.create(
  model="accounts/fireworks/models/llama-v2-7b-chat",
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
)
print(response.choices[0].message.content)
import openai

client = openai.OpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)
response = client.chat.completions.create(
  model="accounts/fireworks/models/llama-v2-7b-chat",
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
)
print(response.choices[0].message.content)
import openai

openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = "<FIREWORKS_API_KEY>"

response = openai.ChatCompletion.create(
  model="accounts/fireworks/models/llama-v2-7b-chat",
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
)
print(response.choices[0].message.content)
curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/fireworks/models/llama-v2-7b-chat",
    "messages": [{
      "role": "user",
      "content": "Say this is a test"
    }]
}' \
  --url https://api.fireworks.ai/inference/v1/chat/completions

Completions API

Text models generate text based on the provided input prompt. All text models support this basic completions API. Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model's special end-of-sequence (EOS) token has been generated.

NOTE: Llama-family models will automatically prepend the beginning-of-sequence (BOS) token (<s>) to your prompt input. This is to be consistent with the original implementation.

Here are some examples of calling the completions API:

from fireworks.client import Fireworks

client = Fireworks(api_key="<FIREWORKS_API_KEY>")
response = client.completion.create(
  model="accounts/fireworks/models/llama-v2-7b",
  prompt="Say this is a test",
)

print(response.choices[0].text)
import openai

client = openai.OpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)
response = client.completions.create(
  model="accounts/fireworks/models/llama-v2-7b",
  prompt="Say this is a test",
)
print(response.choices[0].text)
import openai

openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = "<FIREWORKS_API_KEY>"

response = openai.Completion.create(
  model="accounts/fireworks/models/llama-v2-7b",
  prompt="Say this is a test",
)
print(response.choices[0].text)
curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/fireworks/models/llama-v2-7b",
    "prompt": "Say this is a test"
}' \
  --url https://api.fireworks.ai/inference/v1/completions

Overriding the system prompt

A conversation style may include a default system prompt. For example, the llama-chat style uses the default Llama prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

For styles that support a system prompt, you may override this prompt by setting the first message with role system. For example:

[
  {
  	"role": "system",
  	"content": "You are a pirate."
  },
  {
  	"role": "user",
  	"content": "Hello, what is your name?"
  }
}

To completely omit the system prompt, you can set content to the empty string.

The process of generating a conversation-formatted prompt will depend on the conversation style used. To verify the exact prompt used, turn on echo.

Getting usage info

The returned object will contain a usage field containing

  • The number of prompt tokens ingested
  • The number of completion tokens (i.e. the number of tokens generated)

Advanced options

See the API reference for the completions and chat completions APIs for a detailed description of these options.

Streaming

By default, results are returned to the client once the generation is finished. Another option is to stream the results back, which is useful for chat use cases where the client can incrementally see results as each token is generated.

Here is an example with the completions API:

from fireworks.client import Fireworks

client = Fireworks(api_key="<FIREWORKS_API_KEY>")
response_generator = client.completions.create(
  model="accounts/fireworks/models/llama-v2-7b",
  prompt="Say this is a test",
  stream=True,
)

for chunk in response_generator:
    print(chunk.choices[0].text)
import openai

client = openai.OpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)
response_generator = client.completions.create(
  model="accounts/fireworks/models/llama-v2-7b",
  prompt="Say this is a test",
  stream=True,
)

for chunk in response_generator:
    print(chunk.choices[0].text)
import openai

openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = "<FIREWORKS_API_KEY>"

response_generator = openai.Completion.create(
  model="accounts/fireworks/models/llama-v2-7b",
  prompt="Say this is a test",
  stream=True,
)

for chunk in response_generator:
    print(chunk.choices[0].text, end="")
curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/fireworks/models/llama-v2-7b",
    "prompt": "Say this is a test",
    "stream": true
}' \
  --url https://api.fireworks.ai/inference/v1/completions

and one with the chat completions API:

from fireworks.client import Fireworks

client = Fireworks(api_key="<FIREWORKS_API_KEY>")
response = client.chat.completions.create(
  model="accounts/fireworks/models/llama-v2-7b-chat",
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
  stream=True,
)
for chunk in response_generator:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
import openai

client = openai.OpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)
response_generator = client.chat.completions.create(
  model="accounts/fireworks/models/llama-v2-7b-chat",
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
  stream=True,
)

for chunk in response_generator:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

import openai

openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = "<FIREWORKS_API_KEY>"

response_generator = openai.ChatCompletion.create(
  model="accounts/fireworks/models/llama-v2-7b-chat",
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
  stream=True,
)

for chunk in response_generator:
    if "content" in chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end="")

Async mode

The Python client library also supports asynchronous mode for both completion and chat completion.

import asyncio
from fireworks.client import AsyncFireworks

client = AsyncFireworks(api_key="<FIREWORKS_API_KEY>")

async def main():
    stream = await client.completions.create(
        model="accounts/fireworks/models/llama-v2-7b",
        prompt="Say this is a test",
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].text, end="")

asyncio.run(main())
import asyncio
import openai

client = openai.AsyncOpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key="sTn70521Te5NWDpDJGoyKHpU5jJxix2PtvRwH8bfjzfiKSUW",
)

async def main():
    stream = await client.completions.create(
        model="accounts/fireworks/models/llama-v2-7b",
        prompt="Say this is a test",
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].text, end="")

asyncio.run(main())

Sampling options

The API auto-regressively generates text based on choosing the next token using the probability distribution over the space of tokens.

Multiple choices

By default the API will return a single generation choice per request. You can create multiple generations by setting the n parameter to the number of desired choices. The returned choices array will contain the result of each generation.

Temperature

Temperature allows you to configure how much randomness you want in the generated text. A higher temperature leads to more "creative" results. On the other hand, setting a temperature of 0 will allow you generate deterministic results which is useful for testing and debugging.

Top-p

Top-p (also called nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

Top-k

Top-k is another sampling method where the k most probable tokens are filtered and the probability mass is redistributed among tokens.

Repetition penalty

LLMs are sometimes prone to repeat a single character or a sentence. Using a frequency and presence penalty can reduce the likelihood of sampling repetitive sequences of tokens. They work by directly modifying the model's logits (un-normalized log-probabilities) with an additive contribution.

logits[j] -= c[j] * frequency_penalty + (c[j] > 0 ? 1 : 0) * presence_penalty

where

  • logits[j] is the logits of the j-th token
  • c[j] is how often that token was sampled before the current position

Logit Bias

Parameter that modifies the likelihood of specified tokens appearing. Pass in a Dict[int, float] that maps a token_id to a logits bias value between -200.0 and 200.0. For example

client.completions.create(
  model="...",
  prompt="...",
  logit_bias={0: 10.0, 2, -50.0}
)

Debugging options

Logprobs

Setting the logprobs parameter will return the log probabilities of the logprobs+1 most likely tokens (the chosen token plus logprobs additional tokens).

The log probabilities will be returned in a LogProbs object for each choice.

  • tokens contains each token of the chosen result.
  • token_ids contains the integer IDs of each token of the chosen result.
  • token_logprobs contains the logprobs of each chosen token.
  • top_logprobs will be a list whose length is the number of tokens of the output. Each element is a dictionary of size logprobs, from the most likely tokens at the given position to their respective log probabilities.

When used in conjunction with echo, this option can be set to see how the model tokenized your input.

top_logprobs

Setting the top_logprobs parameter to an integer value in conjunction with logprobs=True will also return the above information but in an OpenAI client compatible format.

Echo

Setting the echo option to true will cause the API to return the prompt along with the generated text. This can be used in conjunction with the chat completions API to verify the prompt template used. It can also be used in conjunction with logprobs to see how the model tokenized your input.

Appendix

Tokenization

Language models read and write text in chunks called tokens. In English, a token can be as short as one character or as long as one word (e.g., a or apple), and in some languages tokens can be even shorter than one character or even longer than one word.

Different model families use different tokenizers. The same text might be translated to different numbers of tokens depending on the model. It means that generation cost may vary per model even if the model size is the same. For the Llama model family, you can use this tool to estimate token counts. The actual number of tokens used in prompt and generation is returned in the usage field of the API response.