Querying vision-language models

Using the web console

Please refer to https://readme.fireworks.ai/docs/querying-text-models for accessing the web console.

Using the API

Both completions API and chat completions API are supported. However, we would recommend users to stick to chat completions API for simplicity.

Chat Completions API

All vision-language models should have a conversation config and have chat completions API enabled. These models are typically tuned with a specific conversation styles for which they perform best. For example FireLLaVA models use the following template:

SYSTEM: {system message}

USER:
{user message}

ASSISTANT:

The <image> substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.

In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance. When used as a text-only language model, you can also just call API without the nested message content field (see https://readme.fireworks.ai/docs/querying-text-models for details).

Here are some examples of calling the chat completions API:

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.ChatCompletion.create(
  model = "accounts/fireworks/models/firellava-13b",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)
import openai

client = openai.OpenAI(
  base_url = "https://api.fireworks.ai/inference/v1",
  api_key = "<FIREWORKS_API_KEY>",
)
response = client.chat.completions.create(
  model = "accounts/fireworks/models/firellava-13b",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)
import openai

openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = "<FIREWORKS_API_KEY>"

response = openai.ChatCompletion.create(
  model = "accounts/fireworks/models/firellava-13b",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)
curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/fireworks/models/firellava-13b",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Can you describe this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
                    }
                }
            ]
        }
    ]
}' \
  --url https://api.fireworks.ai/inference/v1/chat/completions

In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:

import fireworks.client
import base64

# Helper
function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
  return base64.b64encode(image_file.read()).decode('utf-8')

# The path to your image
image_path = "your_image.jpg"

#
The base64 string of the
image_base64 = encode_image(image_path)

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.ChatCompletion.create(
  model = "accounts/fireworks/models/firellava-13b",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": f "data:image/jpeg;base64,{image_base64}"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)
import openai
import base64

# Helper
function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
  return base64.b64encode(image_file.read()).decode('utf-8')

# The path to your image
image_path = "your_image.jpg"

#
The base64 string of the
image_base64 = encode_image(image_path)

client = openai.OpenAI(
  base_url = "https://api.fireworks.ai/inference/v1",
  api_key = "<FIREWORKS_API_KEY>",
)
response = client.chat.completions.create(
  model = "accounts/fireworks/models/firellava-13b",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": f "data:image/jpeg;base64,{image_base64}"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)
import openai
import base64

# Helper
function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
  return base64.b64encode(image_file.read()).decode('utf-8')

# The path to your image
image_path = "your_image.jpg"

#
The base64 string of the
image_base64 = encode_image(image_path)

openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = "<FIREWORKS_API_KEY>"

response = openai.ChatCompletion.create(
  model = "accounts/fireworks/models/firellava-13b",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": f "data:image/jpeg;base64,{image_base64}"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)
curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/fireworks/models/firellava-13b",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Can you describe this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/jpeg;base64,<image_base64....>"
                    }
                }
            ]
        }
    ]
}' \
  --url https://api.fireworks.ai/inference/v1/chat/completions

Overriding the system prompt

A conversation style may include a default system prompt. For example, the llava-chat style uses the default Llava prompt:

A chat between a curious user and an artificial intelligence. The assistant gives helpful, detailed, and polite answers to the user's questions.

For styles that support a system prompt, you may override this prompt by setting the first message with role system. For example:

[
  {
  	"role": "system",
  	"content": "You are a pirate."
  },
  {
  	"role": "user",
  	"content": "Hello, what is your name?"
  }
}

To completely omit the system prompt, you can set content to the empty string.

For more details, please refer to the Overriding the system prompt section of https://readme.fireworks.ai/docs/querying-text-models

Completions API

Advanced users can also query the completions API directly. Users will need to manually insert the image token <image> where appropriate and supply the list of images as an ordered list (this is true for FireLLaVA model, but may be subject to change for future vision-language models). For example:

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.Completion.create(
  model = "accounts/fireworks/models/firellava-13b",
  prompt = "SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:",
  images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)

API Limitations

Right now, we impose certain limit on the completions API and chat completions API as follows:

  1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs
  2. All the images are smaller than 5MB in size, and if the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error

Model Limitations

At the moment, FireLLaVA is the only VLM available, and owing to the nature of the model and how it was trained, model performance may degrade when there are multiple images in the same conversation.

Advanced options

Please refer to https://readme.fireworks.ai/docs/querying-text-models for more advanced options and generation parameters.

Managing images

The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download.

For long running conversations, we suggest passing images via URL's instead of base64. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected them to be.

Calculating cost

For FireLLaVA-13B, each image is treated as 576 prompt tokens. The pricing is otherwise identical to a 13B text models. For more information, please refer to our our pricing page here.

We currently do not charge separately for high resolution images and low resolution images, they all cost 576 tokens.

FAQ

Can I fine-tune the image capabilities with FireLlava?

Not right now, but we are working on integrating FireLlava with fine tuning. If you are interested, please reach out to us via Discord.

Can FireLlava generate images?

No. We have a list of models deployed on our platform that is StableDiffusion based

Please give these models a try and let us know how it goes!

What type of files can I upload?

We currently support .png, .jpg/.jpeg, .gif, .bmp, .tiff and .ppm format images.

Is there a limit to the size of the image I can upload?

Currently our API is restricted to 10MB for the whole request, so the image sent through request in base64 encoding will need to be smaller than 10MB (when converted to base64 encoding). If you are using URLs, then each image need to be smaller than 5MB.

What is the retention policy for the images I upload?

We do not persist the images longer than the server lifetime, and will be deleted automatically.

How do rate limits work with FireLlava?

FireLlava is rate limited like all of our other LLM models, which depends on which tier of rate limiting you are at. For more information, please check out https://readme.fireworks.ai/page/pricing

Can FireLlava understand image metadata

No. If you have image metadata that you want the model to understand, please provide them through the prompt.