Are Llama Vision Models 'Blind,' or Do I Just Not Know How to Use Them?

malawad · November 5, 2024, 9:46am

Hey everyone,

I’m working on an app where users can make a sketch, then the app sends the image to a model to generate a basic HTML page from it. This setup works fine with models like ‘4o’ and others, but with the Llama Vision models, it feels like they don’t even recognize the image I’m sending. I’ve tried various prompts, but nothing seems to help—they can’t even accurately describe what they’re seeing.

Is this behavior normal for Llama Vision, or is there a specific way to use these models and guide them to produce the output I want?

Here is one of my tests:

omkar.gangan · November 5, 2024, 10:54am

Hi @malawad , Can you please re-confirm whether you are using this request format for vision models.
[{ "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "base64 encoded string of image" } }]

Sample request body is mentioned here Sambanova_cloud_api_reference

Thanks & Regards

malawad · November 5, 2024, 11:27am

Hi @omkar.gangan thats not the format used in the api cloud console referance which is this

 [
				{
					"type": "text",
					"text": "What do you see in this image"
				},
				{
					"type": "image_url",
					"text": "<image_in_base_64>"
				}
			]

i cant test anymore at the moment because i exceeded the rate limit .

shivani.moze · November 5, 2024, 1:08pm

Hi @malawad,
We are currently working on updating the documentation for the console to improve clarity and usability. Regarding your testing, please note that the rate limit is applied every minute, so you should try again after some time.
Let me know if you need any further assistance!

Thanks & Regards

malawad · November 6, 2024, 3:17am

Hi @shivani.moze,

Thanks so much for your help! I tried the format suggested by @omkar.gangan, but I keep running into issues like “Rate limit exceeded” or “unexpected_error.” I’m not quite sure how to troubleshoot this. If you have a code example I can try locally to see how the vision model works, that would be incredibly helpful.

Thank you!

omkar.gangan · November 6, 2024, 3:50am

Hi @malawad ,
Can you please try this code example:

import openai
import base64

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

client = openai.OpenAI(
    api_key='SAMBANOVA_API_KEY',
    base_url="https://api.sambanova.ai/v1",
)

# Path to your image
image_path = "/Users/omkarg/Downloads/IMG_2117.jpg" #your image path

# Getting the base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
  model="Llama-3.2-90B-Vision-Instruct",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?",
        },
        {
          "type": "image_url",
          "image_url": {
            "url":  f"data:image/jpeg;base64,{base64_image}"
          },
        },
      ],
    }
  ],
)

print(response.choices[0].message.content)

“Llama-3.2-11B-Vision-Instruct” has Rate limit of 10 requests per minute
“Llama-3.2-90B-Vision-Instruct” has Rate limit of 1 requests per minute (Temporarily limited
due to high demand)

More information regarding rate limit is found at rate_limits and you can also check api_error_codes

Thanks & Regards

malawad · November 6, 2024, 6:08am

Hi @omkar.gangan,

Thanks so much for your help! Please correct me if I’m wrong but It turns out the model doesn’t accept a system prompt, which is why I kept getting the “unexpected_error” message. Once I removed the system prompt, it started working, yet now, whenever I ask it to create HTML from an image, it just replies,

"choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "I can't help with that.",
                "role": "assistant"
            }
        }
    ]

It’s a bit frustrating haha! On the plus side, it does describe the image to some degree, so that’s a start. Now, I just need to figure out how to communicate with it effectively. I really appreciate your assistance!

omkar.gangan · November 6, 2024, 7:32am

We will review this whether vision models accepts a system prompt or not and get back to you as soon as we can with any further information.

Kind Regards

coby.adams · November 6, 2024, 8:06pm

@malawad I was able to get something working if I changed by text prompt to this

"text": "Please generate the HTML to build the item in this image"

And it gave me both the html and css for the image of a webtable that I provided . Have you tried passing what normally would have been in your system prompt into the text ?

Coby

malawad · November 16, 2024, 6:25pm

thank you so much @omkar.gangan and @coby.adams for both your help, i have put the app on showcase , i would really appreciate the feedback there.