(VLMs) are highly effective fashions able to inputting each pictures and textual content, and responding with textual content. This permits us to carry out visible data extraction on paperwork and pictures. On this article, I’ll focus on the newly launched Qwen 3 VL, and the highly effective capabilities VLMs possess.
Qwen 3 VL was launched a couple of weeks in the past, initially with the 235B-A22B mannequin, which is sort of a big mannequin. They then launched the 30B-A3B, and simply now launched the dense 4B and 8B variations. My purpose for this text is to focus on the capabilities of imaginative and prescient language fashions and inform you of their capabilities on a excessive degree. I’ll use Qwen 3 VL as a particular instance on this article, although there are lots of different high-quality VLMs out there. I’m not affiliated in any method with Qwen when writing this text.

Why do we want imaginative and prescient language fashions
Imaginative and prescient language fashions are crucial as a result of the choice is to as a substitute depend on OCR and feed the OCR-ed textual content into an LLM. This has a number of points:
- OCR isn’t good, and the LLM must take care of imperfect textual content extraction
- You lose the knowledge contained within the visible place of the textual content
Conventional OCR engines like Tesseract have lengthy been tremendous essential to doc processing. OCR has allowed us to enter pictures and extract the textual content from them, enabling additional processing of the contents of the doc. Nevertheless, conventional OCR is much from good, and it might battle with points like small textual content, skewed pictures, vertical textual content, and so forth. In case you have poor OCR output, you’ll battle with all downstream duties, whether or not you’re utilizing regex or an LLM. Feeding pictures on to VLMs, as a substitute of OCR-ed textual content to LLMs, is up to now more practical in using data.
The visible place of textual content is usually crucial to understanding the which means of the textual content. Think about the instance within the picture under, the place you will have checkboxes highlighting which textual content is related, the place some checkboxes are ticked off, and a few should not. You may then have some textual content corresponding to every checkbox, the place solely the textual content beside the ticked-off checkbox is related. Extracting this data utilizing OCR + LLMs is difficult, as a result of you may’t know which textual content the ticked checkbox belongs to. Nevertheless, fixing this activity utilizing imaginative and prescient language fashions is trivial.

I fed the picture above to Qwen 3 VL, and it replied with the response proven under:
Based mostly on the picture offered, the paperwork which can be checked off are:
- **Doc 1** (marked with an "X")
- **Doc 3** (marked with an "X")
**Doc 2** will not be checked (it's clean).
As you may see, Qwen 3 VL simply solved the issue appropriately.
One more reason we want VLMs is that we additionally get video understanding. Actually understanding video clips could be immensely difficult utilizing OCR, as a whole lot of the knowledge in movies will not be displayed with textual content, however slightly proven as a picture instantly. OCR is thus not efficient. Nevertheless, the brand new technology of VLMs means that you can enter tons of of pictures, for instance, representing a video, permitting you to carry out video understanding duties.
Imaginative and prescient language mannequin duties
There are lots of duties you may apply imaginative and prescient language fashions to. I’ll focus on a couple of of probably the most related duties.
- OCR
- Data extraction
The info
I’ll use the picture under for instance picture for my testing.

I’ll use this picture as a result of it’s an instance of an actual doc, very related to use Qwen 3 VL on. Moreover, I’ve cropped the picture to its present form, in order that I can feed the picture with a excessive decision into Qwen 3 VL on my native pc. Sustaining a excessive decision is crucial if you wish to carry out OCR on the picture. I’ve extracted the JPG from a PDF utilizing 600 DPI. Usually, 300 DPI is sufficient for OCR, however I stored a better DPI simply to make sure, which works on this small picture.
Put together Qwen 3 VL
I would like the next imports to run Qwen 3 VL:
torch
speed up
pillow
torchvision
git+https://github.com/huggingface/transformers
It’s good to set up Transformers from supply (GitHub), as Qwen 3 VL will not be but out there within the newest Transformers model.
The next code hundreds the imports, mannequin, and processor, and creates an inference perform:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Picture
import os
import time
# default: Load the mannequin on the out there gadget(s)
mannequin = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-4B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
def _resize_image_if_needed(image_path: str, max_size: int = 1024) -> str:
"""Resize picture if wanted to a most dimension of max_size. Preserve the facet ratio."""
img = Picture.open(image_path)
width, top = img.dimension
if width <= max_size and top <= max_size:
return image_path
ratio = min(max_size / width, max_size / top)
new_width = int(width * ratio)
new_height = int(top * ratio)
img_resized = img.resize((new_width, new_height), Picture.Resampling.LANCZOS)
base_name = os.path.splitext(image_path)[0]
ext = os.path.splitext(image_path)[1]
resized_path = f"{base_name}_resized{ext}"
img_resized.save(resized_path)
return resized_path
def _build_messages(system_prompt: str, user_prompt: str, image_paths: checklist[str] | None = None, max_image_size: int | None = None):
messages = [
{"role": "system", "content": [{"type": "text", "text": system_prompt}]}
]
user_content = []
if image_paths:
if max_image_size will not be None:
processed_paths = [_resize_image_if_needed(path, max_image_size) for path in image_paths]
else:
processed_paths = image_paths
user_content.lengthen([
{"type": "image", "min_pixels": 512*32*32, "max_pixels": 2048*32*32, "image": image_path}
for image_path in processed_paths
])
user_content.append({"sort": "textual content", "textual content": user_prompt})
messages.append({
"position": "consumer",
"content material": user_content,
})
return messages
def inference(system_prompt: str, user_prompt: str, max_new_tokens: int = 1024, image_paths: checklist[str] | None = None, max_image_size: int | None = None):
messages = _build_messages(system_prompt, user_prompt, image_paths, max_image_size)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(mannequin.gadget)
start_time = time.time()
generated_ids = mannequin.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
return output_text[0]
OCR
OCR is a activity that the majority VLMs are skilled for. You’ll be able to for instance learn the technical reviews of the Qwen VL fashions, the place they point out how OCR knowledge is part of the coaching set. To coach VLMs to carry out OCR they offer the mannequin a sequence of pictures, and the textual content contained in these pictures. The mannequin then learns to extract the textual content from the photographs.
I’ll apply OCR to the picture with the immediate under, which is identical immediate the Qwen crew makes use of to carry out OCR in response to the Qwen 3 VL cookbook.
user_prompt = "Learn all of the textual content within the picture."
Now I’ll run the mannequin. I known as the check picture we’re working on, for example-doc-site-plan-cropped.jpg
system_prompt = """
You're a useful assistant that may reply questions and assist with duties.
"""
user_prompt = "Learn all of the textual content within the picture."
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
Plan- og
bygningsetaten
Dato: 23.01.2014
Bruker: HKN
Målestokk 1:500
Ekvidistanse 1m
Høydegrunnlag: Oslo lokal
Koordinatsystem: EUREF89 - UTM sone 32
© Plan- og bygningsetaten,
Oslo kommune
Originalformat A3
Adresse:
Camilla Colletts vei 15
Gnr/Bnr:
.
Kartet er sammenstilt for:
.
PlotID: / Finest.nr.:
27661 /
Deres ref: Camilla Colletts vei 15
Kommentar:
Gjeldende kommunedelplaner:
KDP-BB, KDP-13, KDP-5
Kartutsnittet gjelder vertikalinvå 2.
I tillegg finnes det regulering i
følgende vertikalinvå:
(Hvis clean: Ingen øvrige.)
Det er ikke registrert
naturn mangfold innenfor
Se tegnforklaring på eget ark.
Beskrivelse:
NR:
Dato:
Revidert dato:
This output is from my testing, utterly right, and covers all of the textual content within the picture, and extracts all right characters.
Data extraction
You can even carry out data extraction utilizing imaginative and prescient language fashions. This will, for instance, be used to extract essential metadata from pictures. You sometimes additionally need to extract this metadata right into a JSON format, so it’s simply parsable and can be utilized for downstream duties. On this instance, I’ll extract:
- Date – 23.01.2024 on this instance
- Tackle – Camilla Colletts vei 15 on this instance
- Gnr (road quantity) – which within the check picture is a clean area
- Målestokk (scale) – 1:500
I’m working the next code:
user_prompt = """
Extract the next data from the picture, and reply in JSON format:
{
"date": "The date of the doc. In format YYYY-MM-DD.",
"tackle": "The tackle talked about within the doc.",
"gnr": "The road quantity (Gnr) talked about within the doc.",
"scale": "The size (målestokk) talked about within the doc.",
}
Should you can not discover the knowledge, reply with None. The return object should be a legitimate JSON object. Reply solely the JSON object, no different textual content.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
{
"date": "2014-01-23",
"tackle": "Camilla Colletts vei 15",
"gnr": "15",
"scale": "1:500"
}
The JSON object is in a legitimate format, and Qwen has efficiently extracted the date, tackle, and scale fields. Nevertheless, Qwen has really returned a gnr. Initially, once I noticed this consequence, I assumed this was a hallucination, for the reason that Gnr area within the check picture is clean. Nevertheless, Qwen has really made a pure assumption that the Gnr is on the market within the tackle, which is right on this occasion.
To make certain of its capabilities to reply None if it may possibly’t discover something, I requested Qwen to extract the Bnr (constructing quantity), which isn’t out there on this instance. Operating the code under:
user_prompt = """
Extract the next data from the picture, and reply in JSON format:
{
"date": "The date of the doc. In format YYYY-MM-DD.",
"tackle": "The tackle talked about within the doc.",
"Bnr": "The constructing quantity (Bnr) talked about within the doc.",
"scale": "The size (målestokk) talked about within the doc.",
}
Should you can not discover the knowledge, reply with None. The return object should be a legitimate JSON object. Reply solely the JSON object, no different textual content.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
I get:
{
"date": "2014-01-23",
"tackle": "Camilla Colletts vei 15",
"Bnr": None,
"scale": "1:500"
}
In order you may see, Qwen does handle to tell us if data will not be current within the doc.
Imaginative and prescient language fashions’ downsides
I might additionally like to notice that there are some points with imaginative and prescient language fashions as properly. The picture I examined OCR and knowledge extraction with is a comparatively easy picture. To really check the capabilities of Qwen 3, I must expose it to tougher duties, for instance, extracting extra textual content from an extended doc or making it extract extra metadata fields.
The principle present downsides with VLMs, from what I’ve seen, are:
- Typically lacking textual content with OCR
- Inference is gradual
VLMs lacking textual content when performing OCR is one thing I’ve noticed a couple of instances. When it occurs, the VLM sometimes simply misses a bit of the doc and utterly ignores the textual content. That is naturally very problematic, because it might miss textual content that’s crucial for downstream duties like performing key phrase searches. The rationale this occurs is an advanced subject that’s out of scope for this text, however it’s an issue try to be conscious of should you’re performing OCR with VLMs.
Moreover, VLMs require a whole lot of processing energy. I’m working regionally on my PC, although I’m additionally working a really small mannequin. I began experiencing reminiscence points once I merely wished to course of a picture with dimensions of 2048×2048, which is problematic if I need to carry out textual content extraction from bigger paperwork. You’ll be able to thus think about how resource-intensive it’s to use VLMs to both:
- Extra pictures directly (for instance, processing a 10-page doc)
- Processing paperwork of upper resolutions
- Utilizing a bigger VLM, with extra parameters
Conclusion
On this article, I’ve mentioned VLMs, the place I began off discussing why we want VLMs, highlighting how some duties require each textual content and the visible place of the textual content. Moreover, I highlighted some duties you may carry out with VLMs and the way Qwen 3 VL was in a position to carry out these duties. I feel the imaginative and prescient modality might be an increasing number of essential within the coming years. Up till a 12 months in the past, nearly all focus was on pure textual content fashions. Nevertheless, to realize much more highly effective fashions, we have to make the most of the imaginative and prescient modality, which is the place I consider VLMs might be extremely essential.