Hairy Snail ‐ Entry #1395 ‐ Data Clustering Contest

04 Nov 2024 by Tarek Ziadé · The Mozilla Blog

Image generated by DALL-E in response to a request for a photorealistic image of a fox standing in a grassy landscape.

Firefox 130 introduces automatic alt text for PDF images and an improved alt text flow. In addition to protecting users’ privacy with a small language model that operates locally on their device, these improvements help ensure more images receive alt text resulting in more accessible PDFs.

You can read more about our work on the model in the earlier Hacks blog post Experimenting with local alt text generation in Firefox.

The work on the model happens outside of the mozilla-central code base, but as with the rest of the Firefox code, we want to keep the process open to our community. The language models used in our product are just weight files, and we want to ensure the Mozilla community understands how it was built and can help improve it. The open source AI definition from OSI is a work in progress, and our long-term aspiration is to follow the OSI’s guidelines for our local models.

Here’s how you can contribute to improving the model and helping with the accessibility of PDF documents.

What can be improved?

The first version of the model is a work in progress, and it will make mistakes, especially when describing complex images. This is why we designed the feature to:

Encourage human review so that our users can correct inaccuracies and include any missing details before saving the alt text.
Set expectations for users interacting with PDFs that have alt text generated:
- When you see the text This alt text was created automatically message below the text box on the alt text editor, you’ll know that the alt text was generated using our model.
- All users who are reading the PDF outside of the Firefox editor will experience a disclaimer that comes before the alt text. This is so people reading the alt text with a screen reader or directly on the PDF can be informed that the alt text was not human-generated. For example: “Created automatically: [alt text description will go here]”.

We hope to improve the model over time, and, as with Firefox’s source code, anyone interested in helping us refine it is welcome to contribute. You don’t have to be an AI expert – but if you are an expert and spot specific areas of improvement, we’d love to hear from you.

You can contribute by adding a new issue to our repository and choosing a topic from the issue templates:

Model architecture
Training Data
Training code

Here’s some information to help you file an issue under one of these topics:

Model architecture

Our vision encoder-decoder model has 180M parameters and is based on the following pre-trained models:

The VIT model was pre-trained on millions of images on the ImageNet 21k classes, which uses 21,000 words from the wordnet hierarchy to find objects in images.

The version of GPT-2 used for the text decoder is a distilled version of the GPT-2 model – a process that is used to transfer knowledge from a model to a smaller model with minimal accuracy loss. That makes it a good trade-off in terms of size and accuracy. Additionally, we built a ~800-word stop list to avoid generating profanity.

The whole model is 180M parameters and was quantized converting float 32 weights to int8, allowing us to shrink the size on disk to ~180MB which sped up the inference time in the browser.

There are many other architectures that could have been used for this job, or different quantization levels. If you believe there is a better combination, we’d love to try it.

The constraints are:

Everything needs to be open source under a permissive license like APLv2.
The model needs to be converted into ONNX using optimum.
The model needs to work in Transformers.js.

Training data

To train our model, we initially used the COCO and Flickr30k datasets and eventually adapted them to remove some of the annotator biases we’ve found along the way:

Some annotators use gender-specific descriptions. People in an image may be described as a man or a woman, which can lead to the model misgendering people. For instance, a person on a skateboard is almost always described as a man. Similar problems exist with age-specific terms (e.g., man, boy, etc.).
Some descriptions may also use less-than-inclusive language or be culturally or personally offensive in some rare cases. For instance, we have spotted annotations that were only acceptable for use by and within specific demographics, were replaced in common speech by other terms decades ago, or imposed a reductive value (e.g., sexy).

To deal with these issues, we rewrote annotations with GPT-4o using a prompt that asks for a short image description. You can find that code here and the transformed datasets are published on Hugging Face: Mozilla/flickr30k-transformed-captions-gpt4o and Mozilla/coco-gpt4o. You can read more about our process here.

Training our model using these new annotations greatly improved the results, however, we still detected some class imbalance – some types of images are underrepresented like transportation and some are overrepresented, like… cats. To address this, we’ve created a new complementary dataset using Pexels, with this script and GPT4-o annotations. You can find it at Mozilla/pexels-gpt4o.

We know this is still insufficient, so if you would like to help us improve our datasets, here’s what you can do:

If you used the feature and detected a poorly described image, send it to us so we can add it to our training datasets.
Create a dataset on HuggingFace to fix one or more specific class imbalances.
reate a dataset on HuggingFace to simply add more diverse, high-quality data.

We ask the datasets to contain the following fields:

Image: the image in PNG, with a maximum width or height of 700 pixels.
Source: the source of the image.
License: the license of the image. Please ensure the images you’re adding have public domain or public-domain-equivalent licenses, so they can be used for training without infringing on the rights of copyright holders.

This will allow us to automatically generate its description using our prompt, and create a new dataset that we will include in the training loop.

Training code

To train the model, we are using Transformers’ Seq2SeqTrainer in a somewhat standard way (see more details here).

Let us know if you spot a problem or find a potential improvement in the code or in our hyperparameters!

Related Articles

What can be improved?

Model architecture

Training data

Training code