DeepSeek-V4 Can't Read Images? I Made It Read

#vibecoding #ai #programming #coding

Don't wait for a multimodal model, you can use it now

Introduction

Have you ever had that frustrating moment: you are coding with deepseek-v4 in OpenCode, your code throws an error, you want to screenshot it and send it to DeepSeek, and then you remember that DeepSeek cannot read images.

I have to say deepseek-v4 is cheap, easy to use, and has a long context. It has already become my main coding model. But as of mid-June, DeepSeek still hasn't released a multimodal version. That means anything involving images, like reading error screenshots, interpreting charts, or recreating pages from visual designs, it cannot do.

I am not the only one frustrated. My friends are all waiting eagerly too.

But I found a way: I developed a small plugin called observer in OpenCode that lets deepseek-v4 call a multimodal agent to gain the ability to read images indirectly.

After more than a month of polishing, this plugin now handles all image-related coding tasks in my daily work. Today, I will share how I built this plugin, hoping it can help you too.

The plugin code and agent definitions mentioned in this article are at the end. Feel free to grab them.

Demo of Real-World Results

Before diving into the long tutorial, you probably care most about how well this plugin works and whether it is worth your time to try. So let me show you some screenshots of the plugin in action.

1. Interpreting error stack traces

We start with the simplest task: have deepseek-v4 interpret a screenshot of an error stack trace and find key information. I randomly picked a screenshot of an error I encountered at work:

Then in OpenCode Desktop, I sent this image to the plan agent using deepseek-v4-pro and asked it to provide a solution:

As you can see, the plan agent gave an answer based on the screenshot information.

2. Interpreting charts

Another multimodal use case is interpreting charts from documents. For this example, I took a screenshot of a company's annual revenue chart and tested it. I still used the plan agent with deepseek-v4-pro. For an extra challenge, I asked the agent to give some key insights on the numbers in the chart:

The agent read the numbers from the chart and provided some key insights:

3. Developing HTML pages from designs

In frontend development, the biggest demand for multimodal capability is recreating visual designs. Here I found a design with complex page elements to see if the build agent using deepseek-v4-flash could recreate the page:

Here is the recreated page:

One thing is sure: the deepseek-v4-flash model generated the frontend code, and it only took one prompt to get this result. It did not get a 100% match, but with a few more rounds of conversation, you can tweak it until it is perfect. Keep in mind deepseek-v4-flash is dirt cheap.

It costs several times or even ten times less than multimodal models like kimi k2.6 or qwen3.7 plus. They are not in the same league.

Of course, you can also crop a section of the page, mark the areas that need attention, and ask DeepSeek to adjust them, like this:

The agent perceives the marked area and gives the primary agent an adjustment plan per your request.

4. Generating HTML pages from hand-drawn sketches

Maybe you are like me and have zero design skills. No problem. We can hand-draw rough sketches. The agent can understand them. For example, in a recent project, I hand-drew a few web page design sketches: