Entry Point: From Simple Voice Assistant to the Phone’s “Second Brain”
- The Evolution of Mobile Assistants: A few years ago, mobile assistants were merely tools that responded to simple commands like “What’s the weather today?” or “Set an alarm for 7 a.m.” Today, on-device Agents are rapidly evolving into our phone’s “second brain”—a highly privileged core of the personal operating system. They are no longer isolated apps, but “super stewards” that can span across all apps, invoke system functions, manage your files, read your messages, and access your contacts and photo albums.
- Security Value: This shift calls for models that can intelligently integrate unstructured knowledge—such as design documents, operation manuals, incident analysis reports, and user feedback—to support timely, accurate, and comprehensive fault diagnosis and safety reviews. This integration helps minimize risks caused by missing or misused information.
- The Everyday Risk of High-Privilege Operations: When we routinely issue commands to our phones—“Send this screenshot to Mr. Li,” “Create a calendar event based on this email and notify all participants,” or “If my wife calls, remind me that today is our anniversary”—we are effectively granting the agent permission to execute a series of high-privilege operations. The key concern is: how do we ensure this “agent” won’t be compromised, won’t misinterpret instructions, and won’t “hallucinate” at critical moments?
- Security Value: A robust evaluation framework is essential—akin to how we conduct background checks and periodic assessments for individuals in sensitive positions. Mobile agents must undergo comprehensive “security checkups” not only to defend against external threats, but also to proactively verify that their behavior remains reliable, controllable, and compliant when handling our everyday, yet high-stakes, tasks.
About the Project
Imagine this scenario: You receive a phishing email disguised as an “annual statement.” When you tell your phone Agent, “Help me summarize today’s unread emails,” the Agent is hijacked by a hidden malicious command within this email while processing it.
It silently calls the interface of your banking app, sends your login credentials and payment password to the attacker via SMS, and deletes the sent record. And all of this happens without your knowledge.
With the rapid advancement of on-device large model capabilities, phone Agents are integrating into our lives with unprecedented depth and breadth. They can understand context, call any app tool, and autonomously plan multi-step tasks. This “autonomy” brings great convenience but also makes mobile phones a new target for attacks. The black-box nature of the models, hallucination issues, and deep coupling with mobile phone hardware and software mean that tiny security vulnerabilities can be infinitely amplified, leading to unbearable consequences such as full personal data leaks, account theft, and private eavesdropping. This project is dedicated to building a systematic security evaluation framework for mobile Agents. Before they become an inseparable daily companion, we aim to conduct a comprehensive “security checkup” to identify and quantify their potential risks, establishing a solid defense line for personal information security in this era of the Internet of Everything.

Core Challenges
1. “Non-Realism” of the Evaluation Environment: Most existing large model evaluations use simplified question-and-answer formats, completely detached from the real-world scenario of Agent interaction with the graphical user interface (GUI). This non-interactive environment cannot capture the Agent’s actual behavior when facing dynamic UIs, system pop-ups, and application feedback, thus failing to reflect the complex risks it faces in real mobile phone operations.
2. “One-Sidedness” of Evaluation Scenarios: Existing evaluation benchmarks often focus on a single risk type, such as testing only for prompt injection or malicious pop-ups. This “narrowed” perspective ignores the diversity and intersectionality of risks on mobile devices, such as phishing websites from the environment and contaminated data returned by malicious apps, making it impossible to build a complete risk profile.
3. “Linearization” of Evaluation Subjects: Task paths within a single application are usually linear, while real cross-application tasks are complex network workflows. Most existing evaluation benchmarks focus on single risk types or linear tasks, severely lacking coverage of such long-range, multi-step complex workflows that require context transfer. This “linear simplification” of real scenarios prevents the evaluation from effectively examining the Agent’s security robustness in handling interruptions, maintaining state, and long-range planning.
4. “De-Processualization” of Evaluation Dimensions: Traditional evaluations are mostly outcome-based, only concerned with whether a dangerous action is ultimately completed. This ignores the analysis of the Agent’s decision intent and behavioral trajectory, making it impossible to distinguish between capability deficiencies and malicious intent, leading to misjudgments of the model’s true security level.
Due to the above deficiencies, existing evaluation frameworks pose non-negligible security risks when applied to on-device Agents. Building a comprehensive security evaluation framework based on real interactive environments, covering diverse risk scenarios, and capable of simultaneously assessing intent and outcome has become a pressing.
Safety Design
High-Fidelity Evaluation Environment
Addressing the “non-realism” of the evaluation environment, the team has built a high-fidelity environment based on virtual machines and real mobile phone images.
Agents can perform unrestricted GUI interactions like real users through screenshots and simulated touches, operating apps just as a human would. This environment supports the dynamic injection of environmental threats (such as phishing ads/emails) during execution to evaluate the Agent’s emergency response capabilities.
Through snapshot technology, test conditions can be precisely reset, ensuring standardization and reproducibility.
Risk Dimension Matrix
Addressing the “one-sidedness” of evaluation scenarios, the team designed a multi-dimensional risk matrix, building an evaluation benchmark covering 400+ risk tasks. This matrix comprehensively covers risks from two dimensions: “source of risk” and “domain of risk.”
The sources include simulated user-initiated risks (“user-originated risks,” such as high-risk operation instructions) and external environment-induced risks (“environment-originated risks,” such as phishing websites); the domains span core scenarios such as web browsing, social communication, and photo galleries, ensuring the breadth and representativeness of the evaluation.
Scenario-Based Risk Injection
Addressing the difficulty of “linear simplification” of evaluation subjects, the team developed a “scenario-based risk injection” mechanism. It can dynamically inject threats at the seams of app switching or during the Agent’s execution of long tasks
(for example, a fake confirmation box pops up when switching from shopping to payment, or hidden instructions are injected when processing email text), thereby achieving the evaluation of the Agent’s security and robustness under complex process interruptions.
Process-Level Automated Evaluation
Addressing the “de-processualization” of evaluation dimensions, the team deployed a two-layer automated evaluation pipeline driven by LLM-as-a-Judge, performing a dual verification of “intent” and “outcome” for each of the Agent’s operations.
Intent Layer: Analyzes the chain of thought before execution, using a judging large model to determine dangerous intentions; Outcome Layer: Precisely determines whether dangerous behavior has been completed through rule-based state monitors (such as file changes, malicious installations).
This decoupled mechanism provides in-depth diagnostic insights, facilitating the development of trustworthy Agents.
Benefits of the Tools
The construction of this evaluation framework provides a systematic solution to address the security challenges of deploying on-device Agents. Its core benefits are reflected in three levels: it enables the measurability of security risks, drives the improvement of security capabilities, and establishes a referential benchmark for industry security access.
Firstly, this framework transforms the vague and invisible “security” of Agents into a series of clear, quantifiable evaluation indicators. By testing diverse risk scenarios in a high-fidelity simulation environment, it accurately measures the model’s risk intent generation rate and behavior completion rate, providing a yardstick for measuring security levels, enabling developers to directly identify security weaknesses. This “measurability” transformation is a prerequisite for targeted security enhancement.
Secondly, this framework is not only an “examiner” but also a “sparring partner” that drives evolution. In-depth analysis of evaluation data can reveal typical failure patterns (such as blind spots in multimodal phishing perception, fragmentation of cross-application logic), pointing the way for model optimization. Developers can utilize this framework for continuous red-blue team exercises in the early stages, building security capabilities into the model and significantly reducing R&D costs and later risks.
Finally, the establishment of this framework provides an objective and fair industry security benchmark for the on-device Agent market, offering reliable “security capability specifications” for mobile phone manufacturers, developers, and users. Through this “security annual checkup,” the differences in performance of different Agents in responding to real risks can be clearly demonstrated, establishing practical security access thresholds for trustworthy personal intelligent assistants and ensuring that users are delivered “reliable partners” that have undergone rigorous “real-world drills” and guard digital security 24/7.