Hướng Dẫn Một Flow Tạo Video Với AI Veo 3: Video Dài, Nhân Vật Cố Định, Điều Khiển Góc Quay

Làm sao để biến một ý tưởng trong đầu thành video đúng ý với Veo 3 của Google? Làm thế nào để video dài hơn 8 giây, nhân vật không bị “mỗi cảnh một mặt”? Muốn điều khiển góc quay điện ảnh trong Flow thì phải làm sao? Vì sao nhập thoại tiếng Việt trong Veo 3 lại hay báo lỗi? Đây là những câu hỏi mà rất nhiều bạn gặp phải khi bắt đầu tạo video AI bằng Veo 3.

Trong phần này, Trùm Tài Khoản sẽ chia sẻ cho bạn một flow tạo video Veo 3 thực tế, từ:

Tạo prompt chi tiết bằng tiếng Anh một cách đơn giản nhất
Tạo video cinematic đúng ý
Làm video dài nhiều cảnh
Giữ nhân vật ổn định
Điều khiển góc máy – âm thanh – thoại

Bước 1: Hoàn thiện prompt đầu vào cho Veo 3

Cũng giống như các công cụ AI khác, prompt là yếu tố quyết định 80% chất lượng video. Với Veo 3, prompt càng chi tiết thì:

Video càng đúng ý
Nhân vật càng ổn định
Góc quay càng dễ kiểm soát

Dưới đây là một template prompt phong cách cinematic, phù hợp với đa số video người thật, nói chuyện ngoài đời hoặc quảng cáo ngắn.

Template Prompt Cinematic cho Veo 3

SPECS: Camera specs, lenses etc

NO CAPTIONS OR TEXT.

[CHARACTER NAME] ([Age], [2–3 distinctive physical features]) [primary action]

[LOCATION DESCRIPTION]. [1–2 sentences establishing visual environment and lighting].

WIDE SHOT: [Description of what’s visible in wide frame].

MEDIUM SHOT: [Focus on mid-range detail or character position].

CLOSE-UP: [Specific facial feature or important object detail].

EXTREME CLOSE-UP: [Micro detail that communicates emotion or importance].

DIALOGUE:

[CHARACTER NAME] (voice quality description, translate to Vietnamese and say): “First line…”

[CHARACTER NAME] (voice quality description, translate to Vietnamese and say): “Second line…”

[Brief description of environmental movement or lighting change].

AUDIO: [Background sounds], [ambient noises], [voice characteristics], [acoustic qualities].

KEY ELEMENTS: [4–5 essential thematic or visual elements].

Giải thích nhanh cấu trúc prompt

1. SPECS – Thông số máy quay

Dùng để mô tả:

Máy quay chuyên nghiệp / điện thoại
Góc rộng, tiêu cự (24mm, 35mm…)
Phong cách quay: handheld, cinematic, static…

2. CHARACTER NAME – Nhân vật

Mô tả càng cố định càng tốt, gồm:

Tuổi
2–3 đặc điểm ngoại hình nổi bật
→ Phần này cực kỳ quan trọng nếu bạn muốn nhân vật giống nhau giữa nhiều cảnh.

3. LOCATION DESCRIPTION – Bối cảnh

Mô tả:

Không gian
Thời gian (sáng / tối)
Ánh sáng (warm light, cinematic lighting…)

4. Điều khiển góc quay

Bạn có thể dùng:

WIDE SHOT – Toàn cảnh
MEDIUM SHOT – Trung cảnh
CLOSE-UP – Cận mặt
EXTREME CLOSE-UP – Siêu cận

👉 Có thể bỏ bớt hoặc sắp xếp lại thứ tự góc quay tùy video.

5. DIALOGUE – Thoại nhân vật

Mẹo quan trọng:
Nếu nhập thoại tiếng Việt trực tiếp mà Veo 3 báo lỗi, hãy:

Viết thoại bằng tiếng Anh
Thêm cụm: “translate to Vietnamese and say”

👉 Vì Veo 3 hiện chỉ hỗ trợ tiếng Anh, đây là cách “lách” rất hiệu quả.

6. AUDIO – Âm thanh

Mô tả:

Âm thanh môi trường
Giọng nói
Hiệu ứng nền

7. KEY ELEMENTS – Yếu tố cốt lõi

Giúp AI giữ phong cách nhất quán: cinematic, dark tone, comedy, horror, vlog…

Mẹo tạo prompt nhanh, không cần giỏi tiếng Anh

Bạn nên:

Lưu template này thành file .txt hoặc .md
Upload file vào Gemini hoặc ChatGPT
Chỉ cần mô tả ngắn gọn ý tưởng bằng tiếng Việt
Kêu AI tạo prompt theo đúng template

Cách này:

Tiết kiệm thời gian
Dễ chỉnh sửa
Rất phù hợp cho người mới

Bước 2: Tạo video với Veo 3 trong Google Flow

Sau khi có prompt hoàn chỉnh, bạn truy cập Google Flow để bắt đầu tạo video.

Thiết lập quan trọng trong Flow

Chọn Từ văn bản sang video
Số video đầu ra: 1
Model: Veo 3 (Quality) để có audio
Paste prompt → Generate

Sau khoảng 1–3 phút, video dài 8 giây sẽ được tạo.

Làm sao để tạo video dài hơn 8 giây với Veo 3?

Hiện tại, Veo 3 chưa hỗ trợ reference image (chỉ Veo 2 có, nhưng không có audio). Vì vậy, cách làm hiệu quả nhất là:

Chia kịch bản thành nhiều scene

Mỗi scene = 1 prompt = 1 video 8 giây
Phần mô tả nhân vật + môi trường giữ nguyên
Chỉ thay đổi hành động – thoại – góc quay

Sau đó:

Tạo từng video riêng
Ghép lại trong Flow hoặc phần mềm dựng video

Cách này giúp:

Video dài hơn
Nhân vật ổn định hơn
Nội dung liền mạch như phim ngắn

Ví dụ thực tế: Tạo video dài 48 giây với 6 cảnh bằng Veo 3

Bên dưới là một video mình đã tạo bằng Veo 3 với tổng cộng 6 cảnh khác nhau, mỗi cảnh dài khoảng 8 giây, tổng thời lượng lên đến 48 giây, áp dụng đúng flow tạo video mà mình vừa chia sẻ ở trên.

👉 Mỗi cảnh tương ứng với một prompt riêng, nhưng tất cả đều dùng chung mô tả nhân vật và bối cảnh để cố gắng giữ sự nhất quán về hình ảnh và nội dung.

Prompt chi tiết mình để ngay bên dưới cho bạn nào muốn test thử hoặc nghiên cứu sâu hơn.

Character & Setting Details (Recap):

TEACHER: (40s, Vietnamese) Impeccably dressed in a sharp, dark, well-fitted suit (charcoal grey) with a crisp white shirt. Always wears black, opaque sunglasses that completely hide his eyes. Neat hair. Carries an air of serious, academic authority mixed with a subtle, knowing smirk. Precise movements. Clear, commanding Southern Vietnamese accent.

CLASSROOM LOCATION DESCRIPTION: An old, somewhat decaying room in a Vietnamese building. Stuffy air smelling of old paper, dust, faint incense, and a hint of something chemical.

Lighting: Primarily lit by a single, bare, flickering fluorescent tube light (cool white) overhead, casting harsh, shifting shadows. Grimy, barred windows let in minimal, fading daylight.

Walls & Decor: Faded walls covered with pseudo-scientific diagrams of ghosts, faded occult symbols, crudely drawn illustrations of human-ghost encounters, and some out-of-place old human anatomical charts.

Props: Worn wooden lectern. Mannequin with fake ectoplasm. Dusty shelves with jars of “essences.” Stained chalkboard/whiteboard with bizarre formulas. Mismatched old wooden student desks.

Scene 1 Prompt

SPECS: Camera: ARRI Alexa Mini, Lens: Cooke S4/i 32mm.

NO CAPTIONS OR TEXT.

TEACHER (as described above) stands with imposing calm behind the worn wooden lectern, his posture erect as he surveys the sparsely filled room. CLASSROOM (as described above). The single fluorescent tube flickers erratically, casting long, uneasy shadows from the bizarre diagrams and the few students.

WIDE SHOT: Establishes the entire strange classroom: TEACHER at the lectern, the eerie wall decor harshly illuminated, and a few apprehensive STUDENTS scattered at old wooden desks. The sheet-draped mannequin is a silent, unsettling figure in a darker corner. MEDIUM CLOSE-UP: On the TEACHER. His face is partly in shadow due to the overhead light, his sunglasses reflecting the flickering room. A tiny, almost imperceptible smirk is present.

DIALOGUE: TEACHER (authoritative, clear Southern Vietnamese accent, translate to Vietnamese and say): “Welcome to ‘Survival When Encountering Supernatural Entities.’ First lesson: When a ghost scares you, absolutely do not scream.”

A loose ceiling tile visibly trembles, dislodging a small shower of dust. AUDIO: Persistent low electric hum and occasional sharp CRACKLE from the fluorescent light, the faint sound of falling dust, Teacher’s distinct voice. KEY ELEMENTS: Mysterious teacher, detailed eerie classroom, supernatural rules, unsettling atmosphere, strong opening.

Scene 2 Prompt

SPECS: Camera: RED Komodo, Lens: Zeiss Supreme Prime 29mm.

NO CAPTIONS OR TEXT.

TEACHER (as described above) pauses, letting his first rule sink in, then offers his peculiar reasoning. STUDENT A (20s, wearing a simple, slightly worn university jacket, looking genuinely anxious and pale) slowly raises a trembling hand. CLASSROOM (as described above). The flickering light seems to pulse, making the shadows writhe. One of the jars on a high shelf appears to subtly rattle.

MEDIUM SHOT: TEACHER leaning slightly over the lectern, his gloved hands (if wearing them, otherwise bare) pressing down on its surface as he explains. He then turns his head with deliberate slowness towards STUDENT A. CLOSE-UP: STUDENT A’s face, eyes wide with a mixture of fear and morbid curiosity. They gulp audibly before asking their question.

DIALOGUE: TEACHER (grave, Southern Vietnamese accent, translate to Vietnamese and say): “Why? Because it will make you… hoarse! Very bad for the vocal cords!” STUDENT A (timid, voice slightly shaky, translate to Vietnamese and say): “Teacher, if a ghost grabs my leg, what should I do?”

The fluorescent light emits a prolonged, louder BUZZ, then briefly dims before returning to its erratic flickering. AUDIO: Teacher’s emphatic voice, Student A’s hesitant voice, the distinct, prolonged BUZZ and dimming of the light fixture. KEY ELEMENTS: Dark humor, absurd logic, student interaction, building tension, consistent eerie environment, sensory details.

Scene 3 Prompt

SPECS: Camera: Sony Venice, Lens: Panavision Primo 40mm.

NO CAPTIONS OR TEXT.

TEACHER (as described above) straightens up from the lectern, a subtle shift in his posture suggesting he relishes this question. He steps out to the small open space before the desks. CLASSROOM (as described above). The shadows cast by the TEACHER elongate and distort dramatically as he moves. The mannequin in the corner seems, for a split second, to have its head tilted.

MEDIUM LONG SHOT: TEACHER, now center stage in the small clearing, addresses the class. He then begins his demonstration with surprisingly fluid and precise hand gestures, miming the act of tickling empty air with intense focus. CLOSE-UP: On the TEACHER’S face (from the nose down, sunglasses still prominent) showing the serious, almost scientific concentration he applies to the tickling mime.

DIALOGUE: TEACHER (assured, a hint of theatricality, Southern Vietnamese accent, translate to Vietnamese and say): “Very simple! Immediately… cù lét lại nó! Ma cũng biết nhột như ai thôi! Đảm bảo nó sẽ buông ra ngay và cười không nhặt được mồm!” (Accompanies this with the vigorous, precise tickling mime).

A faint, dry, rustling sound, like laughter made of dead leaves, is heard from a dark corner of the room. AUDIO: Teacher’s confident and slightly playful voice, the swish of his suit fabric, the unsettling, dry rustling laughter. KEY ELEMENTS: Physical comedy, absurd solution, Teacher’s unwavering bizarre confidence, unsettling subtle sound, focused character action.

Scene 4 Prompt

SPECS: Camera: Canon C300 Mark III, Lens: Canon CN-E 35mm T1.5.

NO CAPTIONS OR TEXT.

STUDENT B (20s, dressed in a dark, faded band t-shirt, arms crossed initially, now leaning forward on their desk with a challenging glint in their eye) interjects. TEACHER (as described above) turns smoothly to face Student B, listening with polite, unwavering attention. CLASSROOM (as described above). The minimal light from the grimy windows is now almost non-existent. The room is increasingly dependent on the single, failing fluorescent bulb.

MEDIUM SHOT: STUDENT B delivering their question with a clear, skeptical tone. The TEACHER stands patiently, his silhouette framed against a particularly grotesque diagram on the wall. MEDIUM CLOSE-UP: TEACHER, offering a slow, deliberate nod to Student B. His sunglasses reflect the student’s challenging face. A slight, almost condescending smile touches his lips before he speaks.

DIALOGUE: STUDENT B (challenging, firm voice, translate to Vietnamese and say): “Còn nếu ma hiện hình mặt đầy máu me thì sao thầy?” TEACHER (smooth, unperturbed, a tone of explaining something obvious to a child, Southern Vietnamese accent, translate to Vietnamese and say): “À, trường hợp này cần sự tinh tế. Hãy nhẹ nhàng hỏi: ‘Anh/chị ơi, mình xài app filter gì mà ‘real’ quá vậy? Chỉ em với!'”

A distant, mournful howl (dog or something more ambiguous) echoes from outside the building. AUDIO: Student B’s challenging voice, Teacher’s smooth, condescendingly patient tone, the distant, mournful HOWL. KEY ELEMENTS: Student challenge, more satirical advice, escalating absurdity, Teacher’s unshakable composure, ominous external sounds.

Scene 5 Prompt

SPECS: Camera: Panasonic Lumix S1H, Lens: Leica SL 50mm f/1.4.

NO CAPTIONS OR TEXT.

TEACHER (as described above) pushes off lightly from the lectern he had momentarily leaned against, beginning a slow, deliberate pace across the front of the classroom, addressing all students. CLASSROOM (as described above). The room is now very dim. The flickering fluorescent light casts stark, moving shadows, making the eerie diagrams seem to writhe on the walls. The air feels colder.

MEDIUM SHOT: TEACHER pacing, his dark suit making him almost blend into the deeper shadows at the edge of the light’s reach, then re-emerging. He makes a sharp, decisive zigzag motion with his hand as he speaks. CLOSE-UP: On a student’s notebook, where they have shakily scrawled “CHẠY ZÍC ZẮC???” next to a crude drawing of a ghost.

DIALOGUE: TEACHER (voice now brisk and commanding, a shift in energy, Southern Vietnamese accent, translate to Vietnamese and say): “Và nhớ nhé, khi bị ma rượt, đừng chạy đường thẳng! Hãy chạy theo đường ‘zíc zắc’. Ma nó chóng mặt là nó bỏ cuộc ngay!”

The building groans, a deep, structural sound, as if settling or under strain. AUDIO: Teacher’s firm, instructive voice, the sound of his footsteps on the old floorboards, the deep GROAN of the building. KEY ELEMENTS: Further absurd advice, building atmosphere, dynamic movement of Teacher, tangible student reaction (notebook), sense of environmental instability.

Scene 6 Prompt

SPECS: Camera: Blackmagic Pocket Cinema Camera 6K Pro, Lens: Sigma Cine 35mm T1.5.

NO CAPTIONS OR TEXT.

TEACHER (as described above) stops his pacing directly under the weakest point of the flickering fluorescent light. He clasps his hands behind his back, assuming a formal, almost final stance. CLASSROOM (as described above). The room is steeped in gloom. The faces of the students are pale and wide-eyed in the unsteady light. The mannequin seems to have its head turned directly towards the Teacher.

CLOSE-UP: On the TEACHER’S face, specifically his mouth and the lower rim of his sunglasses. His expression is serious, almost grave, as he delivers the homework. The failing light flickers intensely across his features. EXTREME CLOSE-UP: The filament inside the fluorescent tube sputtering violently, glowing erratically.

DIALOGUE: TEACHER (tone becoming slightly more conspiratorial, yet firm, Southern Vietnamese accent, translate to Vietnamese and say): “Bài tập về nhà: Tối nay mỗi người tự tắt đèn ở một mình 15 phút, nếu có gì ‘vui’ thì mai lên chia sẻ kinh nghiệm.”

The fluorescent light emits a final, loud POP and ZAP, then DIES COMPLETELY, plunging the room into absolute darkness. A collective, sharp GASP from the students. AUDIO: Teacher’s distinct voice delivering the ominous homework, the loud POP and ZAP of the light, the collective student GASP, followed by sudden, heavy silence and perhaps a single, terrified whimper. KEY ELEMENTS: Ominous homework assignment, dramatic lighting failure, cliffhanger ending, heightened sensory impact (sound and sudden darkness), peak suspense.

Đánh giá chất lượng & hạn chế hiện tại của Veo 3

Có thể thấy rằng, dù đã ép model bằng prompt rất kỹ để tạo các video khác nhau với nhân vật giống nhau, nhưng Veo 3 hiện tại vẫn còn một số hạn chế:

Giọng nói của nhân vật chưa hoàn toàn đồng nhất giữa các cảnh
Khuôn mặt và biểu cảm có sự thay đổi nhẹ qua mỗi lần render
Chất lượng ổn định ở mức tạm ổn, chưa đạt độ “perfect” cho phim dài liền mạch

👉 Đây là giới hạn chung của việc text-to-video thuần túy ở thời điểm hiện tại.

So sánh với cách dùng frame tham chiếu (reference frame)

Hiện tại, nếu sử dụng tính năng dùng một frame trong video làm tham chiếu cho các video tạo tiếp theo (reference image / frame):

✅ Nhân vật và phong cách nhất quán hơn rất nhiều
❌ Nhưng chỉ hỗ trợ Veo 2, không có âm thanh
❌ Veo 3 chưa hỗ trợ reference frame ở thời điểm viết bài

Tuy nhiên, đây mới chỉ là bản beta, nên khả năng rất cao là Google sẽ sớm cập nhật, cho phép:

Dùng frame tham chiếu cho Veo 3
Giữ được nhân vật + giọng nói nhất quán
Tạo video dài, nhiều cảnh, có thoại một cách dễ dàng hơn

👉 Khi đó, việc làm phim ngắn, TVC, video storytelling bằng AI sẽ nhàn hơn rất nhiều.

Tổng kết: Flow tạo video dài với AI Veo 3 hiệu quả nhất hiện nay

Tóm lại, để biến một ý tưởng thành video đúng ý với Veo 3, bạn nên áp dụng flow sau:

Xây dựng một template prompt chuẩn (cinematic, có góc máy, thoại, âm thanh)
Dùng Gemini hoặc ChatGPT để tạo prompt tự động từ ý tưởng ngắn
Chia kịch bản thành nhiều scene, mỗi scene ~8 giây
Dùng chung mô tả nhân vật + bối cảnh cho tất cả prompt
Render từng scene trong Flow với Veo 3
Ghép các video lại bằng CapCut / Premiere / DaVinci Resolve

👉 Với cách này, bạn hoàn toàn có thể:

Làm video dài 30–60 giây
Điều khiển được góc quay, nhịp kể chuyện
Tạo nội dung kể chuyện, giáo dục, quảng cáo, viral TikTok

0989 172 097

Hướng Dẫn Một Flow Tạo Video Với AI Veo 3: Video Dài, Nhân Vật Cố Định, Điều Khiển Góc Quay