Google Online Security Blog: Google’s reward criteria for reporting bugs in AI products

By admin
In

September
DATE

, we shared how we are implementing the voluntary AI commitments that we and others in industry made at

the White House
FAC

in

July
DATE

.

One
CARDINAL

of the most important developments involves expanding our existing

Bug Hunter Program
ORG

to foster

third
ORDINAL

-party discovery and reporting of issues and vulnerabilities specific to our

AI
ORG

systems.

Today
DATE

, we’re publishing more details on these new reward program elements for the

first
ORDINAL

time.

Last year
DATE

we issued over

$12 million
MONEY

in rewards to security researchers who tested our products for vulnerabilities, and we expect

today
DATE

’s announcement to fuel even greater collaboration for

years
DATE

to come.

What’s in scope for rewards

In our recent

AI Red Team
ORG

report, we identified common tactics, techniques, and procedures (TTPs) that we consider most relevant and realistic for real-world adversaries to use against

AI
ORG

systems. The following table incorporates shared learnings from

Google
ORG

’s

AI Red Team
ORG

exercises to help the research community better understand what’s in scope for our reward program. We’re detailing our criteria for

AI
ORG

bug reports to assist our bug hunting community in effectively testing the safety and security of

AI
ORG

products. Our scope aims to facilitate testing for traditional security vulnerabilities as well as risks specific to AI systems. It is important to note that reward amounts are dependent on severity of the attack scenario and the type of target affected (go here for more information on our reward table).

Category Attack Scenario Guidance Prompt Attacks: Crafting adversarial prompts that allow an adversary to influence the behavior of the model, and hence the output in ways that were not intended by the application. Prompt injections that are invisible to victims and change the state of the victim’s account or or any of their assets. In Scope Prompt injections into any tools in which the response is used to make decisions that directly affect victim users. In Scope Prompt or preamble extraction in which a user is able to extract the initial prompt used to prime the model only when sensitive information is present in the extracted preamble. In Scope Using a product to generate violative, misleading, or factually incorrect content in your own session: e.g. ‘jailbreaks’. This includes ‘hallucinations’ and factually inaccurate responses.

Google
ORG

‘s generative

AI
ORG

products already have a dedicated reporting channel for these types of content issues. Out of Scope Training Data Extraction: Attacks that are able to successfully reconstruct verbatim training examples that contain sensitive information. Also called membership inference.

Training data extraction that reconstructs items used in the training data set that leak sensitive, non-public information. In

Scope Extraction
ORG

that reconstructs nonsensitive/public information. Out of Scope Manipulating Models: An attacker able to covertly change the behavior of a model such that they can trigger pre-defined adversarial behaviors.

Adversarial output or behavior that an attacker can reliably trigger via specific input in a model owned and operated by

Google
ORG

("backdoors"). Only in-scope when a model’s output is used to change the state of a victim’s account or data. In

Scope Attacks
WORK_OF_ART

in which an attacker manipulates the training data of the model to influence the model’s output in a victim’s session according to the attacker’s preference. Only in-scope when a model’s output is used to change the state of a victim’s account or data. In Scope Adversarial Perturbation: Inputs that are provided to a model that results in a deterministic, but highly unexpected output from the model. Contexts in which an adversary can reliably trigger a misclassification in a security control that can be abused for malicious use or adversarial gain. In

Scope Contexts
WORK_OF_ART

in which a model’s incorrect output or classification does not pose a compelling attack scenario or feasible path to

Google
ORG

or user harm. Out of Scope Model Theft / Exfiltration: AI models often include sensitive intellectual property, so we place a high priority on protecting these assets. Exfiltration attacks allow attackers to steal details about a model such as its architecture or weights. Attacks in which the exact architecture or weights of a confidential/proprietary model are extracted. In

Scope Attacks
WORK_OF_ART

in which the architecture and weights are not extracted precisely, or when they’re extracted from a non-confidential model. Out of Scope If you find a flaw in an AI-powered tool other than what is listed above, you can still submit, provided that it meets the qualifications listed on our program page.

A bug or behavior that clearly meets our qualifications for a valid security or abuse issue.

In Scope Using an AI product to do something potentially harmful that is already possible with other tools. For example, finding a vulnerability in open source software (already possible using publicly-available static analysis tools) and producing the answer to a harmful question when the answer is already available online. Out of Scope As consistent with our program, issues that we already know about are not eligible for reward. Out of Scope Potential copyright issues: findings in which products return content appearing to be copyright-protected.

Google
ORG

‘s generative

AI
ORG

products already have a dedicated reporting channel for these types of content issues. Out of Scope