Machine Learning Engineer, Data Generation

DuckDuckGoose • Delft, NL • 21h geleden

About DuckDuckGoose

DuckDuckGoose builds deepfake detection technology for a world where digital media can no longer be trusted by default. Our mission is to help people, companies, and governments determine whether what they see and hear online is authentic.

We are a fast-growing company based in the Netherlands, working with clients across the globe. Our team is international, technical, and mission-driven. We value curiosity, ownership, clear thinking, and a no-nonsense approach to building technology that works in the real world.

The role

Deepfake detection models are only as good as the data they learn from. In this role, you will build the synthetic media datasets and generation workflows that determine how well our models perform against real-world attacks.

We are looking for a Machine Learning Engineer, Data Generation to work at the intersection of machine learning, data engineering, and synthetic media research.

Early on, your focus will be on generating high-quality and varied deepfake data, researching new manipulation methods, structuring datasets, and automating data workflows. You will own repeatable generation workflows end to end: research a method, generate data, validate quality, structure metadata, and hand it off for training or testing.

For example, you might compare three lip-sync generation methods, automate metadata extraction for generated videos, and package the results into a reproducible dataset for the ML team.

You do not need to be a deepfake expert yet, but you will become one. You do need to be technically strong: a capable programmer who can work independently with code, APIs, command-line tools, media files, and structured data. A good candidate may come from computer science, AI, data science, software engineering, or a related technical background, with a serious interest in machine learning and synthetic media.

What you will work on

Research, test, and compare deepfake generation methods, models, tools, APIs, web apps, and mobile applications.
Generate synthetic media across different manipulation types, quality levels, lighting conditions, angles, compression settings, environments, and identities.
Build structured datasets with clear labels, metadata, folder structures, versioning logic, and reproducible generation steps.
Write Python scripts to automate data generation, file organisation, metadata creation, quality checks, and repetitive data tasks.
Make generation workflows reproducible by tracking tools, settings, prompts, model versions, limitations, and failure cases.
Turn ambiguous or poorly documented tools into reliable internal workflows.
Identify low-quality, unrealistic, repetitive, biased, or otherwise unusable data before it affects training or evaluation.
Contribute to evaluation datasets and benchmark sets later as the role grows, while keeping the early focus on data generation and dataset quality.
Work carefully with sensitive data and follow internal privacy, security, and data-handling guidelines.

You might be a good fit if

You enjoy figuring out how technical tools work, even when documentation is incomplete.
You enjoy experimenting with Generative AI to make possible what was previously impossible.
You like turning messy experiments into structured, reproducible datasets.
You are comfortable working at the boundary between software engineering, machine learning, and synthetic media.
You care about details, because small mistakes in labels, metadata, or dataset structure can lead to bad model behaviour.
You enjoy working through ambiguity and turning unclear technical workflows into something reliable.
You already have a strong technical base and want to apply it to a difficult, fast-moving AI problem.

This role is probably not the right fit if you are mainly looking for a pure model-training position or if you prefer working only from well-defined instructions instead of exploring ambiguous technical tools and turning them into reliable workflows.

What we are looking for

Strong programming skills, preferably in Python.
A solid technical foundation, ideally through a degree in Computer Science, Artificial Intelligence, Data Science, Software Engineering, or equivalent practical experience.
Good understanding of core machine learning concepts such as training data, validation data, test data, labels, bias, overfitting, data leakage, and dataset quality.
Ability to work confidently with APIs, command-line tools, technical documentation, structured data, and large collections of media files.
Good working knowledge of Git.
Strong attention to detail when working with datasets, labels, metadata, and documentation.
Ability to debug problems independently and reason from first principles when tools or workflows do not work as expected.
Clear written and verbal communication in English, especially when documenting technical workflows and experimental results.
A curious, precise, structured, and security-conscious working style.

Bonus points for

Experience with image or video processing tools such as OpenCV, FFmpeg, PIL, or similar libraries.
Experience with GenAI tools such as ComfyUI, automatic1111 or DeepFaceLab
Experience working with datasets, annotation tools, data pipelines, or data quality workflows.
Experience with PyTorch, TensorFlow, scikit-learn, or other machine learning tools.
Experience with Docker or running local AI models.
Experience with automation, scraping, or processing large numbers of media files.
Serious interest in deepfakes, generative AI, computer vision, AI safety, fraud, identity verification, cybersecurity, or digital forensics.

What success looks like

In your first months, you will have:

Researched, tested, and documented multiple deepfake generation methods.
Learned what these methods are capable of, and learned to use them in unexpecting ways.
Created structured synthetic media datasets with reliable labels and metadata.
Automated repetitive data tasks with clean, maintainable scripts.
Expanded the variety and coverage of our generated deepfake data.
Improved the reproducibility of our data-generation workflows.
Helped the AI team understand which generation methods, manipulation types, and data variations are represented in our datasets.
Built enough domain understanding to suggest new generation directions, not only execute assigned tasks.

How you can grow

This role can grow with you. You will start close to the data: understanding how synthetic media is created, building repeatable generation workflows, and improving the quality and variety of our datasets. As you gain domain expertise, you can move closer to the core machine learning work behind our product.

Over time, you may take ownership of larger parts of the model-development loop: from data generation and preprocessing to dataset design, model experiments, evaluation sets, and robustness testing. The early focus is generation; evaluation and robustness become more important as your understanding of the data and detection problem deepens.

This is a strong path for someone who wants to build practical expertise at the intersection of machine learning, computer vision, synthetic media, and real-world AI security.

What we offer

Ownership: your work will directly affect model training, dataset quality, product reliability, and how we build our data-generation function.
Speed: you will work in a focused, fast-moving team where good ideas can quickly turn into experiments, datasets, and product improvements.
Mission: you will work on one of the most urgent problems in digital trust: helping people, companies, and governments defend against synthetic media fraud.
Technical depth: you will build expertise in synthetic media generation, computer vision, ML datasets, data quality, and reproducible engineering workflows.
Direct collaboration with machine learning engineers, software engineers, and product specialists.
Flexible working hours and a hybrid setup from Delft.
An international, mission-driven team with a flat hierarchy and a practical engineering culture.
Access to relevant tools, datasets, technical guidance, and learning opportunities.
Being part of the YES!Delft community, surrounded by innovators and entrepreneurs.

Interview process

Our interview process is practical and focused on the work itself: an introductory conversation, a technical interview, and a small assignment or discussion around data generation, dataset structure, and reproducibility.

Why this role matters

Deepfake detection is only as strong as the data behind it. The datasets and generation workflows you build will influence model training, evaluation, product quality, and ultimately the reliability of our deepfake detection system.

In this role, you will help us understand how deepfakes are created, how they vary, and how new generation methods can be turned into useful training and testing data. Your work will directly contribute to building a more secure and trustworthy digital future.

If you want ownership, speed, mission, and technical depth while applying strong engineering skills to one of the most important problems in digital trust, we would like to hear from you.

Apply

Data Scientist

Nationale-Nederlanden • Rotterdam, NL • 2w geleden

Data & Analytics

2w geleden

Apply