PII Masking Module Documentation¶
Overview¶
The PII Masking Module is a lightweight yet robust wrapper around Microsoft Presidio that ensures certain categories of personally identifiable information (PII) are never exposed to backend systems or language model agents on AI Refinery. It is designed for conversational and agentic AI platforms, offering secure, frontend-based redaction of PII including emails, phone numbers, names, and more.
This module is fully configurable (the behavior and settings of the system can be customized by the user via a config file), reversible (masking can be undone through a placeholder mapping), and toggleable (the feature can be turned on/off by the user), making it adaptable for both production-grade privacy enforcement and local development needs.
Note: In this documentation, "PII" refers to the data types that can qualify as personally identifiable information or personal data as listed in Presidio's documentation.
Why Use It?¶
- User Privacy by Default: Ensures that PII included in inputs (e.g., names, emails, IDs) are masked before hitting any backend API, websocket, or agent runtime. No raw PII ever leaves the client without deliberate demasking.
-
Configurable via Project YAML File: PII masking is now toggled and configured directly inside our project's YAML file (e.g.,
pii_example.yaml
,pii_search_example.yaml
). This centralizes privacy settings alongside agent orchestration and utility configs. Example: -
Plug-and-Play: The masking layer works seamlessly with all agents. Whether it's a stateless echo bot or a search agent, PII redaction is handled transparently at the client level — no changes needed in the agent logic.
- Structured Placeholders: Every detected PII entity is replaced with a type-annotated placeholder such as
[EMAIL_1]
,[PERSON_2]
, ensuring clarity and traceability across multi-turn exchanges (this is customizable by the user, who can define if they want toreplace
,redact
orhash
the information - these are what we call the 'operators') -
Default Masking Entities: If users enable PII masking (
enable: True
) in their YAML file but do not specify any entities or operators, the system automatically falls back to the defaults inpii_handler.yaml
. By default, the following PII entities are masked using thereplace
operator:- PERSON - PHONE_NUMBER - EMAIL_ADDRESS - CREDIT_CARD - US_SSN - US_BANK_NUMBER - US_PASSPORT - LOCATION - DATE_TIME - IP_ADDRESS
Each entity will be replaced with a structured placeholder like
[EMAIL_1]
,[PERSON_2]
, etc., unless overridden. -
Session-Based Metadata Tracking: Masking and unmasking operations share state within a session, not per query. This allows consistent unmasking of repeated entities across multiple messages — ideal for chat-based flows.
-
Dual Demo Modes (Interactive + Batch): You can explore the module either interactively or with predefined query samples:
pii_example.py
: A minimal interactive echoing agent demo that allows you to input queries and receive masked responses in real-time (see 'Example 1: pii_example.py and pii_example.yaml' under 'Examples')-
pii_search_example.py
: A batch-style search agent demo that processes multiple sample queries. You can toggle between modes by commenting/uncommenting:(see 'Example 2: pii_search_example.py and pii_search_example.yaml' under 'Examples')
-
Frontend-Only Rehydration: Original content is restorable only locally and only temporarily for display or user confirmation — never transmitted or persistently stored.
- Privacy Enhancing Feature: Supports data minimization and security of PII that might be used in inputs, in line with global data privacy and protection standards, especially in production environments.
Core Design Philosophy¶
Backend-Neutral Privacy¶
PII redaction is performed on the client (SDK) side, before PII reaches:
- agent functions,
- REST or web-socket endpoints,
- logging pipelines,
- or persistent databases.
Each detected entity is substituted with a consistent, format-preserving placeholder (e.g., [EMAIL_1]
, [PERSON_2]
) to maintain context integrity.
Reversible — But Only During Session¶
- Masked outputs are reversible in memory for the duration of a single client session using
PIIHandler
. - This enables frontend-only rehydration of redacted content for display, verification, or QA purposes.
- No PII is ever persisted or sent back to the server.
Microsoft Presidio Integration¶
The PII Masking Module is built on top of Microsoft's Presidio framework, providing robust, customizable, and language-aware detection and masking of PII.
Our system leverages three key components from Presidio:
AnalyzerEngine¶
Detects PII entities (e.g., names, emails, credit cards) in raw text using both pattern-based and ML-based recognizers.
AnonymizerEngine¶
Performs masking or redaction operations based on configuration. In your case, it generates structured placeholder tokens such as [EMAIL_1]
, [PHONE_2]
.
DeanonymizeEngine¶
Allows controlled, reversible recovery of original PII values using internally managed session-bound metadata.
YAML-Driven, Not Hardcoded¶
The module now fully adopts YAML-driven configuration. Instead of toggling flags in Python code, you (as the user) specify:
- Whether masking is enabled (
enable: True
) - Which entities to monitor (
common_entities
) - How each entity should be masked (
entity_operator_mapping
)
Example:
base_config:
pii_masking:
enable: True
config:
common_entities:
- EMAIL_ADDRESS
- PHONE_NUMBER
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params:
new_value: "[EMAIL]"
This makes the system more declarative, scalable, and CI/CD-friendly.
One Masking Context Per Session¶
Unlike traditional systems that handle masking on a per-query basis, our implementation shares the masking state across the entire session. This enables:
-
Reuse of consistent placeholders across turns
(e.g., the same phone number will always map to
[PHONE_1]
) -
Accurate demasking of multi-turn agent conversations
- More natural and trust-preserving UX in chat environments
Agent-Agnostic By Design¶
Whether you're using:
- a
CustomAgent
that simply echoes masked text, - a
SearchAgent
that performs document retrieval, - or a chain-of-thought multi-agent orchestration,
...no changes are needed within the agents. PII protection wraps around the full query life cycle — from input, through orchestration, to output — without interfering with agent logic.
System Flow¶
1. User Input Received¶
- A query containing PII is submitted via a
DistillerClient
orAsyncDistillerClient
instance. - The session is initialized with a YAML configuration (e.g.,
pii_example.yaml
) that enables or disables masking, and defines which entities to protect.
2. PII Detection & Masking (Client-Side Only)¶
PIIHandler.mask_text()
is invoked to scn the input for configuredcommon_entities
.- For each match:
- A format-preserving placeholder is generated (e.g.,
[PHONE_1]
,[EMAIL_2]
) - A mapping between the original value and the placeholder is recorded per session
- A format-preserving placeholder is generated (e.g.,
- If the same entity/value appears in multiple queries, the same placeholder will be reused.
Example:
Original Input:
"Hi, I'm John. Email me at john.doe@company.com or call (212) 555-1234."
Masked Output:
"Hi, I'm [PERSON_1]. Email me at [EMAIL_1] or call [PHONE_1]."
3. Masked Query Sent to Agent(s)¶
- The masked version of the query is passed to agents through the orchestrator defined in the YAML.
- No raw PII reaches:
- Agent logic
- Backend APIs
- Database logs
- Internal storage
- The agents operate entirely on placeholders.
4. Agent Produces Response (Still Masked)¶
- Agent responses are not altered unless frontend demasking is explicitly triggered.
- By default, responses that include placeholders (e.g.,
[EMAIL_1]
) will remain masked when returned to the client.
5. Optional: Demasking for Display¶
- If enabled by the client application (e.g., CLI, notebook, frontend), the response can be passed through
PIIHandler.demask_text()
to reverse placeholders back into original values. - This rehydration occurs:
- Locally only
- Temporarily in memory
- Without logging or persisting raw PII
6. Session Ends → PII is Cleared¶
- When the session ends (or the client is explicitly closed), the
PIIHandler
clears:- The placeholder-to-PII mapping
- Metadata used for demasking
- This ensures PII is never cached, stored, or retrievable after the session.
Enabling or Disabling PII Masking¶
The PII Masking Module is now controlled entirely through our project YAML configuration. This provides a clean, centralized, and declarative interface for enabling or disabling masking on a per-project basis.
How it Works¶
To enable masking, include the following in your YAML config where you define your agents (e.g., pii_example.yaml
, pii_search_example.yaml
):
base_config:
pii_masking:
enable: True
config:
common_entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
...
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params:
new_value: "[EMAIL]"
...
To disable masking, you can either not include the pii_masking
block in your config file, or explicitly set:
If pii_masking.enable
is missing or set to False
, PII masking will be skipped entirely — no detection, no substitution, no metadata tracking.
Runtime Behavior¶
When a project is registered via DistillerClient.create_project(config_path=...)
, the system:
- Reads the
pii_masking
block from the provided YAML config - Initializes the
PIIHandler
accordingly- Enables masking and loads overrides if
enable: True
- Disables masking if
enable: False
or absent -
If user specifies
enable: True
but does not provide any entities (PERSON
,PHONE_NUMBER
) or operators (replace
,redact
,hash
), it defaults to ourpii_handler.yaml
configurations for what to mask, where we essentially merely replace the following entities (which we mentioned above) with a placeholder:
- Enables masking and loads overrides if
This behavior applies to both AsyncDistillerClient
and DistillerClient
Default Configuration File¶
Default PII YAML Configuration: pii_handler.yaml
¶
pii_handler.yaml
is the default configuration file used by thePIIHandler
class to control how PII is detected and masked. It is embedded within the SDK (usually underair/distiller/pii_handler/pii_handler.yaml
) and automatically loaded when the user enables masking by settingbase_config.pii_masking.enable: true
in their project config but does not provide further customization details via thebase_config.pii_masking.config
section of their YAML project file (likepii_example.yaml
).-
pii_handler.yaml
defines:-
What to detect (
common_entities
)A list of PII entity types (e.g., EMAIL_ADDRESS, PERSON, CREDIT_CARD) that should be scanned in user queries.
-
How to mask each type (
entity_operator_mapping
)For each entity, you specify a masking strategy (e.g.,
replace
,redact
, orhash
) and optionally define a custom placeholder.
-
-
This is what it looks like:
common_entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- CREDIT_CARD
- US_SSN
- US_BANK_NUMBER
- US_PASSPORT
- LOCATION
- DATE_TIME
- IP_ADDRESS
entity_operator_mapping:
CREDIT_CARD:
operator: replace
params:
new_value: "[CREDIT_CARD]"
US_SSN:
operator: replace
params:
new_value: "[US_SSN]"
US_BANK_NUMBER:
operator: replace
params:
new_value: "[US_BANK_NUMBER]"
US_PASSPORT:
operator: replace
params:
new_value: "[US_PASSPORT]"
PERSON:
operator: replace
params:
new_value: "[PERSON]"
PHONE_NUMBER:
operator: replace
params:
new_value: "[PHONE]"
EMAIL_ADDRESS:
operator: replace
params:
new_value: "[EMAIL]"
LOCATION:
operator: replace
params:
new_value: "[LOCATION]"
DATE_TIME:
operator: replace
params:
new_value: "[DATE]"
IP_ADDRESS:
operator: replace
params:
new_value: "[IP]"
DEFAULT:
operator: replace
params:
new_value: "<PII>"
Examples¶
Configuration: Authentication¶
In order to be able to make use of our AI Refinery agents which we can now mask leveraging our PII Masking Module feature, you first need to authenticate with an ACCOUNT
number and API_KEY
which need to be granted to you. Next you have to create an environment file in the same directory as the example files (.env file) containins:
pii_example.py
(from Example 1) and pii_search_example.py
(from Example 2) are setup to work with this file
Example 1: pii_example.py and pii_example.yaml¶
Purpose¶
A minimal interactive demo that lets you enter queries via the terminal.
It's ideal for understanding how PII masking integrates into a live session and how placeholder substitution works in real-time.
This uses:
DistillerClient
(synchronous wrapper)- A simple Echoing Agent
- A project config defined in
pii_example.yaml
, including masking rules
How It Works¶
- You authenticate and create a new project using
pii_example.yaml
. - You register an
Echoing Agent
, which simply returns your masked input. - You can interactively enter text, and the PII masking is handled before anything reaches the agent.
- The masked response is printed, and frontend demasking (in memory only) restores original values if needed.
pii_example.py
¶
# pii_example.py
import os
from typing import Any, Awaitable, Callable, Dict, Union, cast
from dotenv import load_dotenv
from air import DistillerClient, login
# Authenticate
load_dotenv()
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
async def echoing_agent(query: str) -> str:
"""A minimal agent that just echoes queries. PII masking is handled by DistillerClient before this."""
return f"Processed query:\n{query}"
def interactive():
"""Launch interactive demo with registered simple agent."""
distiller_client = DistillerClient()
distiller_client.create_project(config_path="pii_example.yaml", project="pii-demo")
executor_dict = {"Echoing Agent": echoing_agent}
distiller_client.interactive(
project="pii-demo",
uuid="some-uuid",
executor_dict=cast(Dict[str, Union[Callable[..., Any], Dict[str, Callable[..., Any]]]], executor_dict),
)
if __name__ == "__main__":
print("\n[PII Demo] Interactive Mode")
interactive()
pii_example.yaml
¶
orchestrator:
agent_list:
- agent_name: "Echoing Agent"
utility_agents:
- agent_class: CustomAgent
agent_name: "Echoing Agent"
agent_description: "This agent receives a query with PII already masked by the distiller client and either responds or echoes your query."
config:
output_style: "conversational"
base_config:
pii_masking:
enable: True
config:
common_entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- CREDIT_CARD
- US_SSN
- US_BANK_NUMBER
- US_PASSPORT
- LOCATION
- DATE_TIME
- IP_ADDRESS
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params: { new_value: "[EMAIL]" }
PERSON:
operator: replace
params: { new_value: "[PERSON]" }
PHONE_NUMBER:
operator: replace
params: { new_value: "[PHONE]" }
CREDIT_CARD:
operator: replace
params: { new_value: "[CREDIT_CARD]" }
US_SSN:
operator: replace
params: { new_value: "[US_SSN]" }
US_BANK_NUMBER:
operator: replace
params: { new_value: "[US_BANK_NUMBER]" }
US_PASSPORT:
operator: replace
params: { new_value: "[US_PASSPORT]" }
LOCATION:
operator: replace
params: { new_value: "[LOCATION]" }
DATE_TIME:
operator: replace
params: { new_value: "[DATE]" }
IP_ADDRESS:
operator: replace
params: { new_value: "[IP]" }
Example 2: pii_search_example.py and pii_search_example.yaml¶
Purpose¶
This example is designed for scripted testing, where a batch of hardcoded queries is sent to an agent.
You can observe how each PII element is masked, and how the system behaves across multiple PII types.
It uses:
AsyncDistillerClient
- A simple
SearchAgent
- The same PII masking engine and configuration logic as Example 1
Flexible Modes¶
The script supports two modes:
- Demo mode (enabled by default) — runs through sample queries programmatically
- Interactive mode — comment out the demo and uncomment the interactive section at the bottom to run it live.
pii_search_example.py
¶
# pii_search_example.py
import asyncio, os, uuid
from typing import Any, Awaitable, Callable, Dict, Union, cast
from dotenv import load_dotenv
from air import login
from air.distiller.client import AsyncDistillerClient
# Authenticate
load_dotenv()
auth = login(account=str(os.getenv("ACCOUNT")), api_key=str(os.getenv("API_KEY")))
async def search_agent(query: str) -> str:
"""Defining a search agent to test PII masking, which is handled by DistillerClient before this."""
return f"Processed query:\n{query}"
async def pii_demo():
queries = [
"Hi, I'm Henry. My number is 4111 1111 1111 1111.",
"Can you book a meeting with Dr. Jane Doe at (212) 555-7890 on May 4th?",
"The IP address 192.168.0.1 should be allowed in the firewall.",
"Email my updated resume to recruiter@company.com.",
"Her SSN is 123-45-6789 and passport is X1234567.",
]
distiller_client = AsyncDistillerClient()
distiller_client.create_project(config_path="pii_search_example.yaml", project="pii-demo")
session_id = str(uuid.uuid4())
await distiller_client.connect(
project="pii-demo",
uuid=session_id,
executor_dict={"Search Agent": search_agent},
)
print("\n[PII Demo] Running Sample Queries\n")
for i, query in enumerate(queries, 1):
print(f"Query {i}:\nOriginal: {query}")
try:
responses = await distiller_client.query(query)
async for response in responses:
print(f"Masked Output:\n{response['content']}\n{'-'*50}")
except Exception as e:
print(f"[ERROR] Failed to process query {i}: {e}")
print("-" * 50)
await distiller_client.close()
def interactive():
distiller_client = AsyncDistillerClient()
distiller_client.create_project(config_path="pii_search_example.yaml", project="pii-demo")
executor_dict = {"Search Agent": search_agent}
distiller_client.interactive(
project="pii-demo",
uuid="some-uuid",
executor_dict=cast(Dict[str, Union[Callable[..., Any], Dict[str, Callable[..., Any]]]], executor_dict),
)
if __name__ == "__main__":
print("\n[PII Demo] Sample Queries")
asyncio.run(pii_demo())
# To try live interaction, comment out the line above and uncomment the next lines:
# print("\n[PII Demo] Interactive Mode")
# interactive()
pii_search_example.yaml
¶
orchestrator:
agent_list:
- agent_name: "Search Agent"
utility_agents:
- agent_class: SearchAgent
agent_name: "Search Agent"
agent_description: "This agent receives a query with or without PII already masked by the distiller client, performs searches and replies to user."
config:
output_style: "conversational"
base_config:
pii_masking:
enable: True
config:
common_entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- CREDIT_CARD
- US_SSN
- US_BANK_NUMBER
- US_PASSPORT
- LOCATION
- DATE_TIME
- IP_ADDRESS
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params: { new_value: "[EMAIL]" }
PERSON:
operator: replace
params: { new_value: "[PERSON]" }
PHONE_NUMBER:
operator: replace
params: { new_value: "[PHONE]" }
CREDIT_CARD:
operator: replace
params: { new_value: "[CREDIT_CARD]" }
US_SSN:
operator: replace
params: { new_value: "[US_SSN]" }
US_BANK_NUMBER:
operator: replace
params: { new_value: "[US_BANK_NUMBER]" }
US_PASSPORT:
operator: replace
params: { new_value: "[US_PASSPORT]" }
LOCATION:
operator: replace
params: { new_value: "[LOCATION]" }
DATE_TIME:
operator: replace
params: { new_value: "[DATE]" }
IP_ADDRESS:
operator: replace
params: { new_value: "[IP]" }
For reference¶
Example | Mode | Client Used | Purpose |
---|---|---|---|
pii_example.py |
Interactive | DistillerClient |
Try queries manually |
pii_search_example.py |
Scripted (or Interactive) | AsyncDistillerClient |
Batch-test masking behavior across PII types + try queries manually with a more complex agent |
Example Interaction¶
Input:
PII Identified:
[PII MASKING] Detected and masked the following PII types:
- PHONE_NUMBER at [24:38] -> '(212) 555-8124' -> [PHONE_1]
- EMAIL_ADDRESS at [67:89] -> 'john.doe@company.com' -> [EMAIL_1]
Masking by PIIHandler.mask_text()
:
Agent Output:
Unmasked (frontend-only) Unmasked View:
This view is reconstructed locally in-memory using metadata saved during masking. The demasking is only available for the session and is never persisted or sent to any backend.
Supported PII Types and Operators¶
Supported PII Types¶
The PII masking module leverages Microsoft Presidio to detect a broad range of commonly regulated or personal data types. All supported types must be explicitly listed in the YAML config under common_entities
.
Entity Type | Placeholder Format | Example Match | Description |
---|---|---|---|
EMAIL_ADDRESS |
[EMAIL_1] |
john.doe@example.com |
Email addresses |
PHONE_NUMBER |
[PHONE_1] |
(212) 555-8124 |
US or international phone numbers |
PERSON |
[PERSON_1] |
Jane Doe |
First and last names |
CREDIT_CARD |
[CREDIT_CARD_1] |
4111 1111 1111 1111 |
Visa/Mastercard/Amex credit cards |
US_SSN |
[US_SSN_1] |
123-45-6789 |
U.S. Social Security Numbers |
US_BANK_NUMBER |
[US_BANK_NUMBER_1] |
987654321 |
U.S. bank account numbers |
US_PASSPORT |
[US_PASSPORT_1] |
X1234567 |
U.S. passport numbers |
LOCATION |
[LOCATION_1] |
1600 Amphitheatre Parkway |
Physical address, city, state, ZIP |
DATE_TIME |
[DATE_1] |
May 4th , 01/01/2024 |
Absolute or relative dates and times |
IP_ADDRESS |
[IP_1] |
192.168.0.1 , 2001:db8::1 |
IPv4 and IPv6 addresses |
To activate detection for a type, include it under common_entities
in your YAML config. The default pii_handler.yaml
and the examples already include all types above.
Supported PII Operators¶
Each entity type can be individually configured in the YAML using one of the supported operators below. You define the operator under entity_operator_mapping
.
replace
¶
- Replaces the original PII with a structured placeholder (e.g.,
[EMAIL_1]
) - Default behavior if not specified
redact
¶
- Completely removes the PII from the text (no placeholder left behind)
Input:
Masked:
hash
¶
- Replaces the original PII with a hashed representation (irreversible)
Input:
Masked:
DEFAULT
Handler (Fallback)¶
To apply a global fallback to any undefined entity type, use the DEFAULT
key:
If Presidio detects an entity type not explicitly listed in entity_operator_mapping
, this operator will apply.
Advanced Customization¶
The PII Masking Module is highly flexible and allows you to tailor both which entities to detect and how to handle them. All customizations are centralized in the same YAML configuration file used for the agent orchestration (e.g., pii_example.yaml
or pii_search_example.yaml
), under base_config.pii_masking
.
Adding More Entities¶
If Presidio supports additional PII types (e.g., IBAN_CODE
, MEDICAL_LICENSE
, or custom recognizers), you can extend your config:
base_config:
pii_masking:
enable: True
config:
common_entities:
- IBAN_CODE
- MEDICAL_LICENSE
- PERSON
Make sure to also define masking behavior:
You can find the full list of built-in PII entity types in Presidio's documentation.
Defining Custom Operators or Placeholder Formats¶
You may redefine any placeholder format per entity by customizing the new_value
:
Or enable hashing for irreversible masking:
Or remove PII altogether (no placeholder shown):
Creating Multiple YAML Variants¶
You can maintain multiple config files (e.g., pii_example.yaml
, pii_search_example.yaml
, pii_strict.yaml
) with different combinations of:
- Enabled/disabled masking
- Different entity sets
- Operator schemes
- Agent configurations
Then pass the desired YAML to create_project(config_path=...)
when registering your project.
Use Case Matrix¶
Below is a guide to help you decide when to use PII masking and how to configure it:
Use Case | Masking Enabled | Recommended Operator | Why This Matters |
---|---|---|---|
Production inference | Yes | replace |
Prevents raw PII from reaching logs, models, or monitoring agents |
Internal debugging | Optional | — | Devs can see original inputs for issue diagnosis |
Compliance audits | Yes | replace , hash |
Shows evidence of redaction while retaining traceability |
External demo/showcases | Yes | replace |
Guarantees privacy-safe interactions during live sessions |
QA & annotation tooling | Optional | replace , redact |
Keep PII masked during human reviews |
Analytics dashboards | Yes | replace , redact |
Prevents PII leakage into metrics or reporting tools |
Sensitive search indexing | Yes | hash , redact |
Allows indexing without storing PII |