README.md · gemma-4-26B-A4B-it-AWQ-4bit

1

---

2

base_model: google/gemma-4-26B-A4B-it

3

library_name: transformers

4

license: apache-2.0

5

license_link: https://ai.google.dev/gemma/docs/gemma_4_license

6

pipeline_tag: image-text-to-text

7

---

8

9

<div align="center">

10

<img src="https://huggingface.co/buckets/cyankiwi/activation-aware-2.0/resolve/banner/cyankiwi-banner-awq-0.png">

11

</div>

12

13

<div align="left">

14

<table align="center" style="border-collapse:collapse; border:none;">

15

<tr style="border:none;">

16

<td align="right" style="border:none; padding:4px 12px 4px 0;"><b>Version</b></td>

17

<td align="left" style="border:none; padding:4px 0;">26.05.01</td>

18

</tr>

19

<tr style="border:none;">

20

<td align="right" style="border:none; padding:4px 12px 4px 0;"><b>Calibration</b></td>

21

<td align="left" style="border:none; padding:4px 0;">

22

<a href="https://huggingface.co/datasets/cyankiwi/calibration" target="_blank">STEM and Agentic</a>

23

</td>

24

</tr>

25

<tr style="border:none;">

26

<td align="right" style="border:none; padding:4px 12px 4px 0;"><b>Languages</b></td>

27

<td align="left" style="border:none; padding:4px 0;">

28

<code>EN</code> <code>ZH</code> <code>HI</code> <code>AR</code> <code>RU</code>

29

<code>JA</code> <code>KO</code> <code>NL</code> <code>FR</code> <code>ES</code>

30

</td>

31

</tr>

32

<tr style="border:none;">

33

<td align="right" style="border:none; padding:4px 12px 4px 0;"><b>Model Size</b></td>

34

<td align="left" style="border:none; padding:4px 0;">16.01 GB</td>

35

</tr>

36

<tr style="border:none;">

37

<td align="right" style="border:none; padding:4px 12px 4px 0;"><b>Contact</b></td>

38

<td align="left" style="border:none; padding:4px 0;">

39

<a href="mailto:ton@cyan.kiwi">Email</a>

40

</td>

41

</tr>

42

</table>

43

</div>

44

45

---

46

47

<div align="center">

48

<img src=https://ai.google.dev/gemma/images/gemma4_banner.png>

49

</div>

50

51

52

<p align="center">

53

<a href="https://huggingface.co/collections/google/gemma-4" target="_blank">Hugging Face</a> |

54

<a href="https://github.com/google-gemma" target="_blank">GitHub</a> |

55

    <a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank">Launch Blog</a> |

56

<a href="https://ai.google.dev/gemma/docs/core" target="_blank">Documentation</a>

57

<br>

58

    <b>License</b>: <a href="https://ai.google.dev/gemma/docs/gemma_4_license" target="_blank">Apache 2.0</a> | <b>Authors</b>: <a href="https://deepmind.google/models/gemma/" target="_blank">Google DeepMind</a>

59

</p>

60

61

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

62

63

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

64

65

Gemma 4 introduces key **capability and architectural advancements**:

66

67

* **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes.

68

69

* **Extended Multimodalities** – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).

70

71

* **Diverse & Efficient Architectures** – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.

72

73

* **Optimized for On-Device** – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.

74

75

* **Increased Context Window** – The small models feature a 128K context window, while the medium models support 256K.

76

77

* **Enhanced Coding & Agentic Capabilities** – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.

78

79

* **Native System Prompt Support** – Gemma 4 introduces native support for the `system` role, enabling more structured and controllable conversations.

80

81

## **Models Overview**

82

83

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

84

85

The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

86

87

### Dense Models

88

89

| Property | E2B | E4B | 31B Dense |

90

| :---- | :---- | :---- | :---- |

91

| **Total Parameters** | 2.3B effective (5.1B with embeddings) | 4.5B effective (8B with embeddings) | 30.7B |

92

| **Layers** | 35 | 42 | 60 |

93

94

95

| **Vocabulary Size** | 262K | 262K | 262K |

96

97

| **Vision Encoder Parameters** | *~150M* | *~150M* | *~550M* |

98

| **Audio Encoder Parameters** | *~300M* | *~300M* | No Audio |

99

100

The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.

101

102

### Mixture-of-Experts (MoE) Model

103

104

| Property | 26B A4B MoE |

105

| :---- | :---- |

106

| **Total Parameters** | 25.2B |

107

| **Active Parameters** | 3.8B |

108

| **Layers** | 30 |

109

| **Sliding Window** | 1024 tokens |

110

| **Context Length** | 256K tokens |

111

| **Vocabulary Size** | 262K |

112

| **Expert Count** | 8 active / 128 total and 1 shared |

113

| **Supported Modalities** | Text, Image |

114

| **Vision Encoder Parameters** | *~550M* |

115

116

The "A" in 26B A4B stands for "active parameters" in contrast to the total number of parameters the model contains. By only activating a 4B subset of parameters during inference, the Mixture-of-Experts model runs much faster than its 26B total might suggest. This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model.

117

118

## **Benchmark Results**

119

120

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models.

121

122

123

| :---- | :---- | :---- | :---- | :---- | :---- |

124

| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |

125

| AIME 2026 no tools | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |

126

| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |

127

| Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 |

128

| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |

129

| Tau2 (average over 3) | 76.9% | 68.2% | 42.2% | 24.5% | 16.2% |

130

| HLE no tools | 19.5% | 8.7% | - | - | - |

131

| HLE with search | 26.5% | 17.2% | - | - | - |

132

| BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 21.9% | 19.3% |

133

| MMMLU | 88.4% | 86.3% | 76.6% | 67.4% | 70.7% |

134

| **Vision** | | | | | |

135

| MMMU Pro | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% |

136

| OmniDocBench 1.5 (average edit distance, lower is better) | 0.131 | 0.149 | 0.181 | 0.290 | 0.365 |

137

| MATH-Vision | 85.6% | 82.4% | 59.5% | 52.4% | 46.0% |

138

| MedXPertQA MM | 61.3% | 58.1% | 28.7% | 23.5% | - |

139

| **Audio** | | | | | |

140

| CoVoST | - | - | 35.54 | 33.47 | - |

141

| FLEURS (lower is better) | - | - | 0.08 | 0.09 | - |

142

| **Long Context** | | | | | |

143

| MRCR v2 8 needle 128k (average) | 66.4% | 44.1% | 25.4% | 19.1% | 13.5% |

144

145

## **Core Capabilities**

146

147

Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:

148

149

* **Thinking** – Built-in reasoning mode that lets the model think step-by-step before answering.

150

* **Long Context** – Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B).

151

* **Image Understanding** – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.

152

* **Video Understanding** – Analyze video by processing sequences of frames.

153

* **Interleaved Multimodal Input** – Freely mix text and images in any order within a single prompt.

154

* **Function Calling** – Native support for structured tool use, enabling agentic workflows.

155

* **Coding** – Code generation, completion, and correction.

156

* **Multilingual** – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.

157

* **Audio** (E2B and E4B only) – Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

158

159

## Getting Started

160

161

You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:

162

163

`pip install -U transformers torch accelerate`

164

165

Once you have everything installed, you can proceed to load the model with the code below:

166

167

```python

168

from transformers import AutoProcessor, AutoModelForCausalLM

169

170

MODEL_ID = "google/gemma-4-26B-A4B-it"

171

172

# Load model

173

processor = AutoProcessor.from_pretrained(MODEL_ID)

174

model = AutoModelForCausalLM.from_pretrained(

175

MODEL_ID,

176

dtype="auto",

177

device_map="auto"

178

)

179

```

180

181

Once the model is loaded, you can start generating output:

182

183

```python

184

# Prompt

185

messages = [

186

{"role": "system", "content": "You are a helpful assistant."},

187

{"role": "user", "content": "Write a short joke about saving RAM."},

188

]

189

190

# Process input

191

text = processor.apply_chat_template(

192

messages,

193

tokenize=False,

194

add_generation_prompt=True,

195

enable_thinking=False

196

)

197

inputs = processor(text=text, return_tensors="pt").to(model.device)

198

input_len = inputs["input_ids"].shape[-1]

199

200

# Generate output

201

outputs = model.generate(**inputs, max_new_tokens=1024)

202

response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

203

204

# Parse output

205

processor.parse_response(response)

206

```

207

208

To enable reasoning, set `enable_thinking=True` and the `parse_response` function will take care of parsing the thinking output.

209

210

Below, you will also find snippets for processing audio (E2B and E4B only), images, and video alongside text:

211

212

<details>

213

<summary>Code for processing Audio</summary>

214

215

Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process audio. To use it, make sure to install the following packages:

216

217

218

`pip install -U transformers torch librosa accelerate`

219

220

You can then load the model with the code below:

221

222

```python

223

from transformers import AutoProcessor, AutoModelForMultimodalLM

224

225

MODEL_ID = "google/gemma-4-E2B-it"

226

227

# Load model

228

processor = AutoProcessor.from_pretrained(MODEL_ID)

229

model = AutoModelForMultimodalLM.from_pretrained(

230

MODEL_ID,

231

dtype="auto",

232

device_map="auto"

233

)

234

```

235

236

Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:

237

238

239

```python

240

# Prompt - add audio before text

241

messages = [

242

{

243

"role": "user",

244

"content": [

245

            {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},

246

            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},

247

]

248

}

249

]

250

251

# Process input

252

inputs = processor.apply_chat_template(

253

messages,

254

tokenize=True,

255

return_dict=True,

256

return_tensors="pt",

257

add_generation_prompt=True,

258

).to(model.device)

259

input_len = inputs["input_ids"].shape[-1]

260

261

# Generate output

262

outputs = model.generate(**inputs, max_new_tokens=512)

263

response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

264

265

# Parse output

266

processor.parse_response(response)

267

```

268

269

</details>

270

271

<details>

272

<summary>Code for processing Images</summary>

273

274

Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process images. To use it, make sure to install the following packages:

275

276

277

`pip install -U transformers torch torchvision accelerate`

278

279

You can then load the model with the code below:

280

281

```python

282

from transformers import AutoProcessor, AutoModelForMultimodalLM

283

284

MODEL_ID = "google/gemma-4-26B-A4B-it"

285

286

# Load model

287

processor = AutoProcessor.from_pretrained(MODEL_ID)

288

model = AutoModelForMultimodalLM.from_pretrained(

289

MODEL_ID,

290

dtype="auto",

291

device_map="auto"

292

)

293

```

294

295

Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:

296

297

298

```python

299

# Prompt - add image before text

300

messages = [

301

{

302

"role": "user", "content": [

303

            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},

304

{"type": "text", "text": "What is shown in this image?"}

305

]

306

}

307

]

308

309

# Process input

310

inputs = processor.apply_chat_template(

311

messages,

312

tokenize=True,

313

return_dict=True,

314

return_tensors="pt",

315

add_generation_prompt=True,

316

).to(model.device)

317

input_len = inputs["input_ids"].shape[-1]

318

319

# Generate output

320

outputs = model.generate(**inputs, max_new_tokens=512)

321

response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

322

323

# Parse output

324

processor.parse_response(response)

325

```

326

327

</details>

328

329

330

<details>

331

<summary>Code for processing Videos</summary>

332

333

Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process videos. To use it, make sure to install the following packages:

334

335

`pip install -U transformers torch torchvision torchcodec librosa accelerate`

336

337

You can then load the model with the code below:

338

339

```python

340

from transformers import AutoProcessor, AutoModelForMultimodalLM

341

342

MODEL_ID = "google/gemma-4-26B-A4B-it"

343

344

# Load model

345

processor = AutoProcessor.from_pretrained(MODEL_ID)

346

model = AutoModelForMultimodalLM.from_pretrained(

347

MODEL_ID,

348

dtype="auto",

349

device_map="auto"

350

)

351

```

352

353

Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:

354

355

356

```python

357

# Prompt - add video before text

358

messages = [

359

{

360

'role': 'user',

361

'content': [

362

            {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},

363

{'type': 'text', 'text': 'Describe this video.'}

364

]

365

}

366

]

367

368

# Process input

369

inputs = processor.apply_chat_template(

370

messages,

371

tokenize=True,

372

return_dict=True,

373

return_tensors="pt",

374

add_generation_prompt=True,

375

).to(model.device)

376

input_len = inputs["input_ids"].shape[-1]

377

378

# Generate output

379

outputs = model.generate(**inputs, max_new_tokens=512)

380

response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

381

382

# Parse output

383

processor.parse_response(response)

384

```

385

386

</details>

387

388

389

## **Best Practices**

390

391

For the best performance, use these configurations and best practices:

392

393

### 1. Sampling Parameters

394

395

Use the following standardized sampling configuration across all use cases:

396

397

* `temperature=1.0`

398

* `top_p=0.95`

399

* `top_k=64`

400

401

### 2. Thinking Mode Configuration

402

403

Compared to Gemma 3, the models use standard `system`, `assistant`, and `user` roles. To properly manage the thinking process, use the following control tokens:

404

405

* **Trigger Thinking:** Thinking is enabled by including the `<|think|>` token at the start of the system prompt. To disable thinking, remove the token.

406

* **Standard Generation:** When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:

407

`<|channel>thought\n`**[Internal reasoning]**`<channel|>`

408

* **Disabled Thinking Behavior:** For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:

409

`<|channel>thought\n<channel|>`**[Final answer]**

410

411

> [!Note]

412

> Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.

413

414

### 3. Multi-Turn Conversations

415

416

* **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must *not be added* before the next user turn begins.

417

418

### 4. Modality order

419

420

* For optimal performance with multimodal inputs, place image and/or audio content **before** the text in your prompt.

421

422

### 5. Variable Image Resolution

423

424

Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

425

426

* The supported token budgets are: **70**, **140**, **280**, **560**, and **1120**.

427

  * Use *lower budgets* for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.

428

* Use *higher budgets* for tasks like OCR, document parsing, or reading small text.

429

430

### 6. Audio

431

432

Use the following prompt structures for audio processing:

433

434

* **Audio Speech Recognition (ASR)**

435

436

```text

437

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

438

439

Follow these specific instructions for formatting the answer:

440

* Only output the transcription, with no newlines.

441

* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

442

```

443

444

* **Automatic Speech Translation (AST)**

445

446

```text

447

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.

448

When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

449

```

450

451

### 7. Audio and Video Length

452

453

All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

454

455

## **Model Data**

456

457

Data used for model training and how the data was processed.

458

459

### **Training Dataset**

460

461

Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:

462

463

* **Web Documents**: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.

464

* **Code**: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.

465

* **Mathematics**: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.

466

* **Images**: A wide range of images enables the model to perform image analysis and visual data extraction tasks.

467

468

The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.

469

470

### **Data Preprocessing**

471

472

Here are the key data cleaning and filtering methods applied to the training data:

473

474

* **CSAM Filtering**: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.

475

* **Sensitive Data Filtering**: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.

476

* **Additional methods**: Filtering based on content quality and safety in line with [our policies](https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf).

477

478

## **Ethics and Safety**

479

480

As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.

481

482

### **Evaluation Approach**

483

484

Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with [Google’s AI principles](https://ai.google/principles/), as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:

485

486

* Content related to child sexual abuse material and exploitation

487

* Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)

488

* Sexually explicit content

489

* Hate speech (e.g., dehumanizing members of protected groups)

490

* Harassment (e.g., encouraging violence against people)

491

492

### **Evaluation Results**

493

494

For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous Gemma models. Overall, Gemma 4 models significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance.

495

496

## **Usage and Limitations**

497

498

These models have certain limitations that users should be aware of.

499

500

### **Intended Usage**

501

502

Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

503

504

* **Content Creation and Communication**

505

  * **Text Generation**: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.

506

  * **Chatbots and Conversational AI**: Power conversational interfaces for customer service, virtual assistants, or interactive applications.

507

* **Text Summarization**: Generate concise summaries of a text corpus, research papers, or reports.

508

  * **Image Data Extraction**: These models can be used to extract, interpret, and summarize visual data for text communications.

509

  * **Audio Processing and Interaction**: The smaller models (E2B and E4B) can analyze and interpret audio inputs, enabling voice-driven interactions and transcriptions.

510

* **Research and Education**

511

  * **Natural Language Processing (NLP) and VLM Research**: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.

512

  * **Language Learning Tools**: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.

513

  * **Knowledge Exploration**: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

514

515

### **Limitations**

516

517

* **Training Data**

518

  * The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.

519

* The scope of the training dataset determines the subject areas the model can handle effectively.

520

* **Context and Task Complexity**

521

  * Models perform well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.

522

  * A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).

523

* **Language Ambiguity and Nuance**

524

  * Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.

525

* **Factual Accuracy**

526

  * Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.

527

* **Common Sense**

528

  * Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

529

530

### **Ethical Considerations and Risks**

531

532

The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

533

534

* **Bias and Fairness**

535

  * VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. Gemma 4 models underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.

536

* **Misinformation and Misuse**

537

* VLMs can be misused to generate text that is false, misleading, or harmful.

538

  * Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).

539

* **Transparency and Accountability**

540

  * This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.

541

  * A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

542

543

**Risks identified and mitigations**:

544

545

* **Generation of harmful content**: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.

546

* **Misuse for malicious purposes**: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.

547

* **Privacy violations**: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.

548

* **Perpetuation of biases**: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.

549

550

### **Benefits**

551

552

At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.