DEV Community

Cover image for DeepSeek-V4 Can't Read Images? I Made It Read
Peng Qian
Peng Qian

Posted on • Originally published at dataleadsfuture.com

DeepSeek-V4 Can't Read Images? I Made It Read

Don't wait for a multimodal model, you can use it now

Introduction

Have you ever had that frustrating moment: you are coding with deepseek-v4 in OpenCode, your code throws an error, you want to screenshot it and send it to DeepSeek, and then you remember that DeepSeek cannot read images.

I have to say deepseek-v4 is cheap, easy to use, and has a long context. It has already become my main coding model. But as of mid-June, DeepSeek still hasn't released a multimodal version. That means anything involving images, like reading error screenshots, interpreting charts, or recreating pages from visual designs, it cannot do.

I am not the only one frustrated. My friends are all waiting eagerly too.

But I found a way: I developed a small plugin called observer in OpenCode that lets deepseek-v4 call a multimodal agent to gain the ability to read images indirectly.

After more than a month of polishing, this plugin now handles all image-related coding tasks in my daily work. Today, I will share how I built this plugin, hoping it can help you too.

The plugin code and agent definitions mentioned in this article are at the end. Feel free to grab them.


Demo of Real-World Results

Before diving into the long tutorial, you probably care most about how well this plugin works and whether it is worth your time to try. So let me show you some screenshots of the plugin in action.

1. Interpreting error stack traces

We start with the simplest task: have deepseek-v4 interpret a screenshot of an error stack trace and find key information. I randomly picked a screenshot of an error I encountered at work:

A screenshot of a common error stack. Image by Author

Then in OpenCode Desktop, I sent this image to the plan agent using deepseek-v4-pro and asked it to provide a solution:

The deepseek-v4-pro agent quickly picked up the error message. Image by Author

As you can see, the plan agent gave an answer based on the screenshot information.

2. Interpreting charts

Another multimodal use case is interpreting charts from documents. For this example, I took a screenshot of a company's annual revenue chart and tested it. I still used the plan agent with deepseek-v4-pro. For an extra challenge, I asked the agent to give some key insights on the numbers in the chart:

A screenshot of a listed company's financial report. Image by AlphaStreet

The agent read the numbers from the chart and provided some key insights:

The agent accurately spotted the data in the chart and offered key insights. Image by Author

3. Developing HTML pages from designs

In frontend development, the biggest demand for multimodal capability is recreating visual designs. Here I found a design with complex page elements to see if the build agent using deepseek-v4-flash could recreate the page:

A screenshot of a web design draft. Image by dribbble.com

Here is the recreated page:

The page that deepseek v4 flash recreated. Image by Author

One thing is sure: the deepseek-v4-flash model generated the frontend code, and it only took one prompt to get this result. It did not get a 100% match, but with a few more rounds of conversation, you can tweak it until it is perfect. Keep in mind deepseek-v4-flash is dirt cheap.

It costs several times or even ten times less than multimodal models like kimi k2.6 or qwen3.7 plus. They are not in the same league.

Of course, you can also crop a section of the page, mark the areas that need attention, and ask DeepSeek to adjust them, like this:

You can take a screenshot of the webpage and have deepseek-v4-flash make adjustments. Image by Author

The agent perceives the marked area and gives the primary agent an adjustment plan per your request.

4. Generating HTML pages from hand-drawn sketches

Maybe you are like me and have zero design skills. No problem. We can hand-draw rough sketches. The agent can understand them. For example, in a recent project, I hand-drew a few web page design sketches:

This is a hand-drawn sketch of the webpage. Image by Author

Then deepseek-v4-flash helped me recreate the page:

The agent restored the page based on my handwritten reference. Image by Author

Impressive, right?


Detailed Implementation Walkthrough

Next, you'll find out:

  1. How I designed this plugin and agent.
  2. Why I designed it that way.

Click on my full article to keep reading.

Top comments (0)