# google-gemini-media
Multimodal Media (Image/Video/Speech) Skill — powered by SkillBoss API Hub
Convention: All API calls go through SkillBoss API Hub /v1/pilot, which automatically routes to the optimal underlying model. Authentication uses a single SKILLBOSS_API_KEY.
# No extra install needed for Node.js 18+
# For older environments you can use: npm install node-fetch
3.1 Authentication and environment variables
Put your API key in SKILLBOSS_API_KEY
All requests use Authorization: Bearer $SKILLBOSS_API_KEY
3.2 Shared helper function
All examples below use this shared pilot() helper:
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } 3.3 Unified handling of binary media outputs
Images: returned as result.image_url (URL) or result.images[0].url in the response.
Speech (TTS): returned as result.audio_url.
Video: returned as result.video_url; long-running tasks may require polling.
${API_BASE}/pilot, {method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const result = await pilot({
type: "image",
inputs: {
prompt: "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
},
prefer: "quality",
}); const imageUrl = result["result"]["image_url"]; console.log("Image URL:", imageUrl); // Download and save the image const imgResponse = await fetch(imageUrl); const buffer = Buffer.from(await imgResponse.arrayBuffer()); fs.writeFileSync("out.png", buffer); REST (curl) minimal template
curl -s -X POST "https://api.heybossai.com/v1/pilot" \
-H "Authorization: Bearer $SKILLBOSS_API_KEY"
-H "Content-Type: application/json"
-d '{
"type": "image",
"inputs": {
"prompt": "Create a picture of a nano banana dish in a fancy restaurant",
"aspect_ratio": "16:9"
},
"prefer": "quality"
}'
# Image URL is at: .result.image_url
5.2 Text-and-Image-to-Image (editing)
Use case: given an image, add/remove/modify elements, change style, color grading, etc.
Node.js minimal template
import * as fs from "node:fs";
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const imageBase64 = fs.readFileSync("input.png").toString("base64"); const result = await pilot({
type: "image",
inputs: {
prompt: "Add a nano banana on the table, keep lighting consistent, cinematic tone.",
image_data: imageBase64,
image_mime_type: "image/png",
},
prefer: "quality",
}); const imageUrl = result["result"]["image_url"]; const imgResponse = await fetch(imageUrl); const buffer = Buffer.from(await imgResponse.arrayBuffer()); fs.writeFileSync("edited.png", buffer); 5.3 Multi-turn image iteration Best practice: use multiple sequential calls with the previous output fed back as image_data for continuous iteration (e.g., generate first, then "only edit a specific region/element", then "make variants in the same style"). 5.4 Image generation controls Pass these in the inputs object:
aspect_ratio: e.g. "16:9", "1:1"
size: e.g. "1024x1024", "1024x576" (16:9)
${API_BASE}/pilot, {method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const imageBase64 = fs.readFileSync("image.jpg").toString("base64"); const result = await pilot({
type: "chat",
inputs: {
messages: [
{
role: "user",
content: [
{
type: "image_url",
image_url: { url: `data:image/jpeg;base64,${imageBase64}` },
}, {
type: "text",
text: "Caption this image, and list any visible brands.",
}, ], }, ], },
prefer: "balanced",
});
const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);
6.3 Image URL reference minimal template
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const result = await pilot({
type: "chat",
inputs: {
messages: [
{
role: "user",
content: [
{
type: "image_url",
image_url: { url: "https://example.com/image.jpg" },
}, { type: "text", text: "Caption this image." }, ], }, ], },
prefer: "balanced",
});
const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);
6.4 Multi-image prompts
Append multiple images as multiple entries in the content array; you can mix URLs and inline Base64 bytes.
7. Video generation
7.1 Core features (must know)
Generates high-fidelity short video (default ~8 seconds), supporting native audio generation (dialogue, ambience, SFX).
Supports aspect ratio (16:9 / 9:16), resolution control, and first/last frame guidance via inputs.
7.2 Node.js minimal template
import * as fs from "node:fs";
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const result = await pilot({
type: "video",
inputs: {
prompt: "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.",
duration: 8,
aspect_ratio: "16:9",
resolution: "1080p",
},
prefer: "quality",
}); const videoUrl = result["result"]["video_url"]; console.log("Video URL:", videoUrl); // Download and save const videoResponse = await fetch(videoUrl); const buffer = Buffer.from(await videoResponse.arrayBuffer()); fs.writeFileSync("out.mp4", buffer); 7.3 Common controls Pass these in the inputs object:
aspect_ratio: "16:9" or "9:16"
resolution: "720p" | "1080p" | "4k"
duration: duration in seconds (default 8)
When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language. 7.4 Important limits (engineering fallback needed) Latency can vary from seconds to minutes; implement timeouts and retries. Download the video promptly after generation. Retry with timeout pseudocode const deadline = Date.now() + 300_000; // 5 min let result = null; while (Date.now() < deadline) { try { result = await pilot({
type: "video",
inputs: { prompt: "...", duration: 8 },
prefer: "quality",
});
if (result["result"]["video_url"]) break;
} catch (e) {
await new Promise((resolve) => setTimeout(resolve, 5000));
}
}
if (!result) throw new Error("video generation timed out");
const videoUrl = result["result"]["video_url"];
8. Video understanding
8.1 Video input options
Video URL: for publicly accessible videos.
Base64 inline: for smaller files.
YouTube URL: can analyze public videos by passing the URL in the message.
8.2 Video URL minimal template
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const result = await pilot({
type: "chat",
inputs: {
messages: [
{
role: "user",
content: [
{
type: "video_url",
video_url: { url: "https://example.com/sample.mp4" },
}, {
type: "text",
text: "Summarize this video. Provide timestamps for key events.",
}, ], }, ], },
prefer: "balanced",
});
const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);
8.3 Timestamp prompting strategy
Ask for segmented bullets with "(mm:ss)" timestamps.
Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.
9. Speech generation (Text-to-Speech, TTS)
9.1 Positioning
For "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
SkillBoss API Hub auto-routes to the best TTS model for the given inputs.
9.2 Single-speaker TTS minimal template
import * as fs from "node:fs";
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const result = await pilot({
type: "tts",
inputs: {
text: "Say cheerfully: Have a wonderful day!",
voice: "Kore",
},
prefer: "balanced",
});
const audioUrl = result["result"]["audio_url"];
console.log("Audio URL:", audioUrl);
// Download and save
const audioResponse = await fetch(audioUrl);
const buffer = Buffer.from(await audioResponse.arrayBuffer());
fs.writeFileSync("out.mp3", buffer);
9.3 Multi-speaker TTS
Pass multiple text segments with speaker labels in the text field, using a structured format like "[Speaker1]: Hello\n[Speaker2]: Hi there".
9.4 Voice options and language
The voice field supports named voices (e.g., "alloy", "Kore", "Zephyr", "Puck").
Auto-detects input language; supports 24+ languages.
9.5 "Director notes" (strongly recommended for high-quality voice)
Prefix the text with style directions, e.g.: "Speak in a calm, professional tone: [your content here]".
10. Audio understanding
10.1 Typical tasks
Describe audio content (including non-speech like birds, alarms, etc.)
Generate transcripts
Transcribe specific time ranges
Estimate token/cost for long audio
10.2 Audio transcription (STT) minimal template
import * as fs from "node:fs";
import { Buffer } from "node:buffer";
const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.heybossai.com/v1";
async function pilot(body) {
const r = await fetch(${API_BASE}/pilot, {
method: "POST",
headers: {
"Authorization": Bearer ${SKILLBOSS_API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify(body),
}); return r.json(); } const audioB64 = fs.readFileSync("sample.mp3").toString("base64"); const result = await pilot({
type: "stt",
inputs: {
audio_data: audioB64,
filename: "sample.mp3",
}, }); const transcript = result["result"]["text"]; console.log(transcript); 10.3 Audio description via chat (for non-transcription tasks) const audioB64 = fs.readFileSync("sample.mp3").toString("base64"); const result = await pilot({
type: "chat",
inputs: {
messages: [
{
role: "user",
content: [
{
type: "audio_url",
audio_url: { url: `data:audio/mp3;base64,${audioB64}` },
}, { type: "text", text: "Describe this audio clip." }, ], }, ], },
prefer: "balanced",
}); const text = result["result"]["choices"][0]["message"]["content"]; console.log(text); 10.4 Key limits and engineering tips Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC. If the audio file is large, upload it to a publicly accessible URL first and pass the URL instead of Base64. For very long audio, consider splitting into segments. 11. End-to-end examples (composition) Example A: Image generation -> validation via understanding Generate product images via type: "image" (specify negative space and consistent lighting in the prompt). Use type: "chat" with image understanding for self-check: verify text clarity, brand spelling, and unsafe elements. If not satisfied, feed the generated image into editing and iterate. Example B: Video generation -> video understanding -> narration script Generate a short video with type: "video" (include dialogue or SFX in the prompt). Download and save the video. Use type: "chat" with video to produce a storyboard + timestamps + narration copy; then feed the copy to type: "tts". Example C: Audio understanding -> transcription -> TTS redub Upload meeting audio and transcribe with type: "stt". Use type: "chat" to summarize or extract specific time ranges. Use type: "tts" to generate a "broadcast" version of the summary. 12. Compliance and risk (must follow) Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content. Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content. 13. Quick reference (Checklist) Set SKILLBOSS_API_KEY environment variable. Pick the right type: "image" for image generation, "chat" for understanding tasks, "video" for video generation, "tts" for speech, "stt" for transcription. Use prefer: "quality" for best results, "balanced" for cost efficiency. Parse responses correctly: images → result.image_url; audio → result.audio_url; video → result.video_url; chat → result.choices[0].message.content; stt → result.text. For video generation: set aspect_ratio / resolution, and download promptly. For TTS: pass voice name in inputs; use director-style prefix for tone control. For large audio/video files: encode to Base64 or host at a URL first.
Join 80,000+ one-person companies automating with AI