PII Masking Module Documentation¶
Overview¶
TheĀ PII Masking ModuleĀ is a lightweight yet robust wrapper around Microsoft Presidio that ensuresĀ personally identifiable information (PII)Ā is never exposed to backend systems or language model agents on AI Refinery. It is designed for conversational and agentic AI platforms, offering secure, frontend-based redaction of sensitive data including emails, phone numbers, names, and more.
This module isĀ fully configurable (the behavior and settings of the system can be customized by the user via a config file),Ā reversible (masking can be undone through a placeholder mapping), andĀ toggleable (the feature can be turned on/off by the user), making it adaptable for both production-grade privacy enforcement and local development needs.
Why Use It?¶
- User Privacy by Default: Ensures that sensitive inputs (e.g., names, emails, IDs) are masked before hitting any backend API, websocket, or agent runtime. No raw PII ever leaves the client without deliberate demasking.
-
Configurable via Project YAML File: PII masking is now toggled and configured directly inside our projectās YAML file (e.g.,
pii_example.yaml
,pii_search_example.yaml
). This centralizes privacy settings alongside agent orchestration and utility configs. Example: -
Plug-and-Play: The masking layer works seamlessly with all agents. Whether it's a stateless echo bot or a search agent, PII redaction is handled transparently at the client level ā no changes needed in the agent logic.
- Structured Placeholders: Every detected entity is replaced with a type-annotated placeholder such asĀ
[EMAIL_1]
,Ā[PERSON_2]
, ensuring clarity and traceability across multi-turn exchanges (this is customizable by the user, who can define if they want toreplace
,redact
orhash
the information - these are what we call the āoperatorsā) -
Default Masking Entities: If users enable PII masking (
enable: True
) in their YAML file but doĀ notĀ specify any entities or operators, the system automatically falls back to the defaults inĀpii_handler.yaml
. By default, the following PII entities are masked using theĀreplace
Ā operator:- PERSON - PHONE_NUMBER - EMAIL_ADDRESS - CREDIT_CARD - US_SSN - US_BANK_NUMBER - US_PASSPORT - LOCATION - DATE_TIME - IP_ADDRESS
Each entity will be replaced with a structured placeholder likeĀ
[EMAIL_1]
,Ā[PERSON_2]
, etc., unless overridden. -
Session-Based Metadata Tracking: Masking and unmasking operations share state within a session, not per query. This allows consistent unmasking of repeated entities across multiple messages ā ideal for chat-based flows.
-
Dual Demo Modes (Interactive + Batch): You can explore the module either interactively or with predefined query samples:
pii_example.py
: A minimalĀ interactive echoing agentĀ demo that allows you to input queries and receive masked responses in real-time (see āExample 1: pii_example.py and pii_example.yamlā under āExamplesā)-
pii_search_example.py
: AĀ batch-style search agent demoĀ that processes multiple sample queries. You can toggle between modes by commenting/uncommenting:(see āExample 2: pii_search_example.py and pii_search_example.yamlā under āExamplesā)
-
Frontend-Only Rehydration: Original content is restorableĀ only locallyĀ andĀ only temporarilyĀ for display or user confirmation ā never transmitted or stored.
- Regulatory Compliance Alignment: Supports data minimization and protection standards likeĀ GDPR,Ā HIPAA, andĀ CCPA, especially in production environments where sensitive inputs must be masked before processing.
Core Design Philosophy¶
Backend-Neutral Privacy¶
PII redaction is performedĀ on the client (SDK) side, before any data reaches:
- agent functions,
- REST or web-socket endpoints,
- logging pipelines,
- or persistent databases.
Each detected entity is substituted with a consistent, format-preserving placeholder (e.g.,Ā [EMAIL_1]
,Ā [PERSON_2]
) to maintain context integrity while safeguarding privacy.
Reversible ā But Only During Session¶
- Masked outputs areĀ reversible in memoryĀ for the duration of a single client session usingĀ
PIIHandler
. - This enables frontend-only rehydration of redacted content for display, verification, or QA purposes.
- No sensitive information is ever persisted or sent back to the server.
Microsoft Presidio Integration¶
The PII Masking Module is built on top of MicrosoftāsĀ PresidioĀ framework, providing robust, customizable, and language-aware detection and anonymization of personally identifiable information (PII).
Our system leverages three key components from Presidio:
AnalyzerEngine¶
Detects PII entities (e.g., names, emails, credit cards) in raw text using both pattern-based and ML-based recognizers.
AnonymizerEngine¶
Performs masking or redaction operations based on configuration. In your case, it generatesĀ structured placeholder tokensĀ such asĀ [EMAIL_1]
,Ā [PHONE_2]
.
DeanonymizeEngine¶
Allows controlled, reversible recovery of original PII values using internally managedĀ session-bound metadata.
YAML-Driven, Not Hardcoded¶
The module now fully adopts YAML-driven configuration. Instead of toggling flags in Python code, you (as the user) specify:
- Whether masking is enabled (
enable: True
) - Which entities to monitor (
common_entities
) - How each entity should be masked (
entity_operator_mapping
)
Example:
base_config:
pii_masking:
enable: True
config:
common_entities:
- EMAIL_ADDRESS
- PHONE_NUMBER
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params:
new_value: "[EMAIL]"
This makes the system more declarative, scalable, and CI/CD-friendly.
One Masking Context Per Session¶
Unlike traditional systems that handle masking on aĀ per-queryĀ basis, our implementation shares the masking state across the entire session. This enables:
-
Reuse of consistent placeholders across turns
(e.g., the same phone number will always map toĀ
[PHONE_1]
) -
Accurate demasking of multi-turn agent conversations
- More natural and trust-preserving UX in chat environments
Agent-Agnostic By Design¶
Whether you're using:
- aĀ
CustomAgent
Ā that simply echoes masked text, - aĀ
SearchAgent
Ā that performs document retrieval, - or a chain-of-thought multi-agent orchestration,
...no changes are needed within the agents. PII protection wraps around the full query life cycle ā from input, through orchestration, to output ā without interfering with agent logic.
System Flow¶
1.Ā User Input Received¶
- A query containing potentially sensitive information is submitted via aĀ
DistillerClient
Ā orĀAsyncDistillerClient
Ā instance. - The session is initialized with a YAML configuration (e.g.,Ā
pii_example.yaml
) that enables or disables masking, and defines which entities to protect.
2.Ā PII Detection & Masking (Client-Side Only)¶
PIIHandler.mask_text()
Ā is invoked to scn the input for configuredĀcommon_entities
.- For each match:
- A format-preserving placeholder is generated (e.g.,Ā
[PHONE_1]
,Ā[EMAIL_2]
) - A mapping between the original value and the placeholder is recordedĀ per session
- A format-preserving placeholder is generated (e.g.,Ā
- If the same entity/value appears in multiple queries, the same placeholder will be reused.
Example:
Original Input:
"Hi, I'm John. Email me at john.doe@company.com or call (212) 555-1234."
Masked Output:
"Hi, I'm [PERSON_1]. Email me at [EMAIL_1] or call [PHONE_1]."
3.Ā Masked Query Sent to Agent(s)¶
- The masked version of the query is passed to agents through the orchestrator defined in the YAML.
- No raw PII reaches:
- Agent logic
- Backend APIs
- Database logs
- Internal storage
- The agents operate entirely on placeholders.
4.Ā Agent Produces Response (Still Masked)¶
- Agent responses areĀ not alteredĀ unless frontend demasking is explicitly triggered.
- By default, responses that include placeholders (e.g.,Ā
[EMAIL_1]
) will remain masked when returned to the client.
5.Ā Optional: Demasking for Display¶
- If enabled by the client application (e.g., CLI, notebook, frontend), the response can be passed throughĀ
PIIHandler.demask_text()
Ā to reverse placeholders back into original values. - This rehydration occurs:
- Locally only
- Temporarily in memory
- Without logging or persisting raw PII
6.Ā Session Ends ā PII is Cleared¶
- When the session ends (or the client is explicitly closed), theĀ
PIIHandler
Ā clears:- The placeholder-to-PII mapping
- Metadata used for demasking
- This ensures PII is never cached, stored, or retrievable after the session.
Enabling or Disabling PII Masking¶
The PII Masking Module is now controlled entirely through ourĀ project YAML configuration. This provides a clean, centralized, and declarative interface for enabling or disabling masking on a per-project basis.
How it Works¶
ToĀ enable masking, include the following in your YAML config where you define your agents (e.g.,Ā pii_example.yaml
,Ā pii_search_example.yaml
):
base_config:
pii_masking:
enable: True
config:
common_entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
...
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params:
new_value: "[EMAIL]"
...
ToĀ disable masking, you can either not include the pii_masking
block in your config file, or explicitly set:
IfĀ pii_masking.enable
Ā is missing or set toĀ False
, PII masking will be skipped entirely ā no detection, no substitution, no metadata tracking.
Runtime Behavior¶
When a project is registered viaĀ DistillerClient.create_project(config_path=...)
, the system:
- Reads theĀ
pii_masking
Ā block from the provided YAML config - Initializes theĀ
PIIHandler
Ā accordingly- Enables masking and loads overrides ifĀ
enable: True
- Disables masking ifĀ
enable: False
Ā or absent -
If user specifies
enable: True
but does not provide any entities (PERSON
,PHONE_NUMBER
) or operators (replace
,redact
,hash
), it defaults to ourpii_handler.yaml
configurations for what to mask, where we essentially merely replace the following entities (which we mentioned above) with a placeholder:
- Enables masking and loads overrides ifĀ
This behavior applies to both AsyncDistillerClient
Ā and DistillerClient
Default Configuration File¶
Default PII YAML Configuration:Ā pii_handler.yaml
¶
pii_handler.yaml
Ā is theĀ default configuration fileĀ used by theĀPIIHandler
Ā class to control how personally identifiable information (PII) is detected and masked. It isĀ embedded within the SDKĀ (usually underĀair/distiller/pii_handler/pii_handler.yaml
) andĀ automatically loadedĀ when the user enables masking by settingĀbase_config.pii_masking.enable: true
Ā in their project config but does not provide further customization details via theĀbase_config.pii_masking.config
Ā section of their YAML project file (likeĀpii_example.yaml
).-
pii_handler.yaml
defines:-
What to detectĀ (
common_entities
)A list of PII entity types (e.g., EMAIL_ADDRESS, PERSON, CREDIT_CARD) that should be scanned in user queries.
-
How to mask each typeĀ (
entity_operator_mapping
)For each entity, you specify a masking strategy (e.g.,Ā
replace
,Āredact
, orĀhash
) and optionally define a custom placeholder.
-
-
This is what it looks like:
common_entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- CREDIT_CARD
- US_SSN
- US_BANK_NUMBER
- US_PASSPORT
- LOCATION
- DATE_TIME
- IP_ADDRESS
entity_operator_mapping:
CREDIT_CARD:
operator: replace
params:
new_value: "[CREDIT_CARD]"
US_SSN:
operator: replace
params:
new_value: "[US_SSN]"
US_BANK_NUMBER:
operator: replace
params:
new_value: "[US_BANK_NUMBER]"
US_PASSPORT:
operator: replace
params:
new_value: "[US_PASSPORT]"
PERSON:
operator: replace
params:
new_value: "[PERSON]"
PHONE_NUMBER:
operator: replace
params:
new_value: "[PHONE]"
EMAIL_ADDRESS:
operator: replace
params:
new_value: "[EMAIL]"
LOCATION:
operator: replace
params:
new_value: "[LOCATION]"
DATE_TIME:
operator: replace
params:
new_value: "[DATE]"
IP_ADDRESS:
operator: replace
params:
new_value: "[IP]"
DEFAULT:
operator: replace
params:
new_value: "<PII>"
Examples¶
Example 1: pii_example.py and pii_example.yaml¶
Purpose¶
AĀ minimal interactive demoĀ that lets you enter queries via the terminal.
It's ideal for understanding howĀ PII masking integrates into a live sessionĀ and how placeholder substitution works in real-time.
This uses:
DistillerClient
Ā (synchronous wrapper)- A simple Echoing Agent
- A project config defined inĀ
pii_example.yaml
, including masking rules
How It Works¶
- You authenticate and create a new project usingĀ
pii_example.yaml
. - You register anĀ
Echoing Agent
, which simply returns your masked input. - You can interactively enter text, and the PII masking is handled before anything reaches the agent.
- The masked response is printed, and frontend demasking (in memory only) restores original values if needed.
pii_example.py
¶
# pii_example.py
import os
from typing import Any, Awaitable, Callable, Dict, Union, cast
from air import DistillerClient, login
# Authenticate
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
async def echoing_agent(query: str) -> str:
"""A minimal agent that just echoes queries. PII masking is handled by DistillerClient before this."""
return f"Processed query:\n{query}"
def interactive():
"""Launch interactive demo with registered simple agent."""
distiller_client = DistillerClient()
distiller_client.create_project(config_path="pii_example.yaml", project="pii-demo")
executor_dict = {"Echoing Agent": echoing_agent}
distiller_client.interactive(
project="pii-demo",
uuid="some-uuid",
executor_dict=cast(Dict[str, Union[Callable[..., Any], Dict[str, Callable[..., Any]]]], executor_dict),
)
if __name__ == "__main__":
print("\n[PII Demo] Interactive Mode")
interactive()
pii_example.yaml
¶
orchestrator:
agent_list:
- agent_name: "Echoing Agent"
utility_agents:
- agent_class: CustomAgent
agent_name: "Echoing Agent"
agent_description: "This agent receives a query with sensitive information already masked by the distiller client and either responds or echoes your query."
config:
output_style: "conversational"
base_config:
pii_masking:
enable: True
config:
common_entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- CREDIT_CARD
- US_SSN
- US_BANK_NUMBER
- US_PASSPORT
- LOCATION
- DATE_TIME
- IP_ADDRESS
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params: { new_value: "[EMAIL]" }
PERSON:
operator: replace
params: { new_value: "[PERSON]" }
PHONE_NUMBER:
operator: replace
params: { new_value: "[PHONE]" }
CREDIT_CARD:
operator: replace
params: { new_value: "[CREDIT_CARD]" }
US_SSN:
operator: replace
params: { new_value: "[US_SSN]" }
US_BANK_NUMBER:
operator: replace
params: { new_value: "[US_BANK_NUMBER]" }
US_PASSPORT:
operator: replace
params: { new_value: "[US_PASSPORT]" }
LOCATION:
operator: replace
params: { new_value: "[LOCATION]" }
DATE_TIME:
operator: replace
params: { new_value: "[DATE]" }
IP_ADDRESS:
operator: replace
params: { new_value: "[IP]" }
Example 2: pii_search_example.py and pii_search_example.yaml¶
Purpose¶
This example is designed forĀ scripted testing, where a batch of hardcoded queries is sent to an agent.
You can observeĀ how each sensitive element is masked, and how the system behaves across multiple PII types.
It uses:
AsyncDistillerClient
- A simpleĀ
SearchAgent
- The same PII masking engine and configuration logic as Example 1
Flexible Modes¶
The script supports two modes:
- Demo mode (enabled by default)Ā ā runs through sample queries programmatically
- Interactive modeĀ ā comment out the demo and uncomment the interactive section at the bottom to run it live.
pii_search_example.py
¶
# pii_search_example.py
import asyncio, os, uuid
from typing import Any, Awaitable, Callable, Dict, Union, cast
from air import login
from air.distiller.client import AsyncDistillerClient
# Authenticate
auth = login(account=str(os.getenv("ACCOUNT")), api_key=str(os.getenv("API_KEY")))
async def search_agent(query: str) -> str:
"""Defining a search agent to test PII masking, which is handled by DistillerClient before this."""
return f"Processed query:\n{query}"
async def pii_demo():
queries = [
"Hi, I'm Henry. My number is 4111 1111 1111 1111.",
"Can you book a meeting with Dr. Jane Doe at (212) 555-7890 on May 4th?",
"The IP address 192.168.0.1 should be allowed in the firewall.",
"Email my updated resume to recruiter@company.com.",
"Her SSN is 123-45-6789 and passport is X1234567.",
]
distiller_client = AsyncDistillerClient()
distiller_client.create_project(config_path="pii_search_example.yaml", project="pii-demo")
session_id = str(uuid.uuid4())
await distiller_client.connect(
project="pii-demo",
uuid=session_id,
executor_dict={"Search Agent": search_agent},
)
print("\n[PII Demo] Running Sample Queries\n")
for i, query in enumerate(queries, 1):
print(f"Query {i}:\nOriginal: {query}")
try:
responses = await distiller_client.query(query)
async for response in responses:
print(f"Masked Output:\n{response['content']}\n{'-'*50}")
except Exception as e:
print(f"[ERROR] Failed to process query {i}: {e}")
print("-" * 50)
await distiller_client.close()
def interactive():
distiller_client = AsyncDistillerClient()
distiller_client.create_project(config_path="pii_search_example.yaml", project="pii-demo")
executor_dict = {"Search Agent": search_agent}
distiller_client.interactive(
project="pii-demo",
uuid="some-uuid",
executor_dict=cast(Dict[str, Union[Callable[..., Any], Dict[str, Callable[..., Any]]]], executor_dict),
)
if __name__ == "__main__":
print("\n[PII Demo] Sample Queries")
asyncio.run(pii_demo())
# To try live interaction, comment out the line above and uncomment the next lines:
# print("\n[PII Demo] Interactive Mode")
# interactive()
pii_search_example.yaml
¶
orchestrator:
agent_list:
- agent_name: "Search Agent"
utility_agents:
- agent_class: SearchAgent
agent_name: "Search Agent"
agent_description: "This agent receives a query with or without sensitive information already masked by the distiller client, performs searches and replies to user."
config:
output_style: "conversational"
base_config:
pii_masking:
enable: True
config:
common_entities:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- CREDIT_CARD
- US_SSN
- US_BANK_NUMBER
- US_PASSPORT
- LOCATION
- DATE_TIME
- IP_ADDRESS
entity_operator_mapping:
EMAIL_ADDRESS:
operator: replace
params: { new_value: "[EMAIL]" }
PERSON:
operator: replace
params: { new_value: "[PERSON]" }
PHONE_NUMBER:
operator: replace
params: { new_value: "[PHONE]" }
CREDIT_CARD:
operator: replace
params: { new_value: "[CREDIT_CARD]" }
US_SSN:
operator: replace
params: { new_value: "[US_SSN]" }
US_BANK_NUMBER:
operator: replace
params: { new_value: "[US_BANK_NUMBER]" }
US_PASSPORT:
operator: replace
params: { new_value: "[US_PASSPORT]" }
LOCATION:
operator: replace
params: { new_value: "[LOCATION]" }
DATE_TIME:
operator: replace
params: { new_value: "[DATE]" }
IP_ADDRESS:
operator: replace
params: { new_value: "[IP]" }
For reference¶
Example | Mode | Client Used | Purpose |
---|---|---|---|
pii_example.py |
Interactive | DistillerClient |
Try queries manually |
pii_search_example.py |
Scripted (or Interactive) | AsyncDistillerClient |
Batch-test masking behavior across PII types + try queries manually with a more complex agent |
Example Interaction¶
Input:
PII Identified:
[PII MASKING] Detected and masked the following PII types:
- PHONE_NUMBER at [24:38] -> '(212) 555-8124' -> [PHONE_1]
- EMAIL_ADDRESS at [67:89] -> 'john.doe@company.com' -> [EMAIL_1]
Masking byĀ PIIHandler.mask_text()
:
Agent Output:
Unmasked (frontend-only) Unmasked View:
This view isĀ reconstructed locally in-memoryĀ using metadata saved during masking. The demasking isĀ only available for the sessionĀ and is never persisted or sent to any backend.
Supported PII Types and Operators¶
Supported PII Types¶
The PII masking module leverages Microsoft Presidio to detect a broad range of commonly regulated or sensitive data types. All supported types must be explicitly listed in the YAML config underĀ common_entities
.
Entity Type | Placeholder Format | Example Match | Description |
---|---|---|---|
EMAIL_ADDRESS |
[EMAIL_1] |
john.doe@example.com |
Email addresses |
PHONE_NUMBER |
[PHONE_1] |
(212) 555-8124 |
US or international phone numbers |
PERSON |
[PERSON_1] |
Jane Doe |
First and last names |
CREDIT_CARD |
[CREDIT_CARD_1] |
4111 1111 1111 1111 |
Visa/Mastercard/Amex credit cards |
US_SSN |
[US_SSN_1] |
123-45-6789 |
U.S. Social Security Numbers |
US_BANK_NUMBER |
[US_BANK_NUMBER_1] |
987654321 |
U.S. bank account numbers |
US_PASSPORT |
[US_PASSPORT_1] |
X1234567 |
U.S. passport numbers |
LOCATION |
[LOCATION_1] |
1600 Amphitheatre Parkway |
Physical address, city, state, ZIP |
DATE_TIME |
[DATE_1] |
May 4th ,Ā 01/01/2024 |
Absolute or relative dates and times |
IP_ADDRESS |
[IP_1] |
192.168.0.1 ,Ā 2001:db8::1 |
IPv4 and IPv6 addresses |
To activate detection for a type, include it underĀ common_entities
Ā in your YAML config. The defaultĀ pii_handler.yaml
and the examples already include all types above.
Supported PII Operators¶
Each entity type can be individually configured in the YAML using one of the supported operators below. You define the operator underĀ entity_operator_mapping
.
replace
¶
- Replaces the original PII with a structured placeholder (e.g.,Ā
[EMAIL_1]
) - Default behaviorĀ if not specified
redact
¶
- Completely removes the PII from the text (no placeholder left behind)
Input:
Masked:
hash
¶
- Replaces the original PII with a hashed representation (irreversible)
Input:
Masked:
DEFAULT
Ā Handler (Fallback)¶
To apply a global fallback to any undefined entity type, use theĀ DEFAULT
Ā key:
If Presidio detects an entity type not explicitly listed inĀ entity_operator_mapping
, this operator will apply.
Advanced Customization¶
The PII Masking Module is highly flexible and allows you to tailor bothĀ which entities to detectĀ andĀ how to handle them. All customizations are centralized in the same YAML configuration file used for the agent orchestration (e.g.,Ā pii_example.yaml
Ā orĀ pii_search_example.yaml
), underĀ base_config.pii_masking
.
Adding More Entities¶
If Presidio supports additional PII types (e.g.,Ā IBAN_CODE
,Ā MEDICAL_LICENSE
, or custom recognizers), you can extend your config:
base_config:
pii_masking:
enable: True
config:
common_entities:
- IBAN_CODE
- MEDICAL_LICENSE
- PERSON
Make sure to also define masking behavior:
You can find the full list of built-in entity types inĀ Presidio's documentation.
Defining Custom Operators or Placeholder Formats¶
You may redefine any placeholder format per entity by customizing theĀ new_value
:
Or enable hashing for irreversible masking:
Or remove PII altogether (no placeholder shown):
Creating Multiple YAML Variants¶
You can maintain multiple config files (e.g.,Ā pii_example.yaml
,Ā pii_search_example.yaml
,Ā pii_strict.yaml
) with different combinations of:
- Enabled/disabled masking
- Different entity sets
- Operator schemes
- Agent configurations
Then pass the desired YAML toĀ create_project(config_path=...)
Ā when registering your project.
Use Case Matrix¶
Below is a guide to help you decide when to use PII masking and how to configure it:
Use Case | Masking Enabled | Recommended Operator | Why This Matters |
---|---|---|---|
Production inference | Yes | replace |
Prevents raw PII from reaching logs, models, or monitoring agents |
Internal debugging | Optional | ā | Devs can see original inputs for issue diagnosis |
Compliance audits | Yes | replace ,Ā hash |
Shows evidence of redaction while retaining traceability |
External demo/showcases | Yes | replace |
Guarantees privacy-safe interactions during live sessions |
QA & annotation tooling | Optional | replace ,Ā redact |
Keep data semi-anonymized during human reviews |
Analytics dashboards | Yes | replace ,Ā redact |
Prevents PII leakage into metrics or reporting tools |
Sensitive search indexing | Yes | hash ,Ā redact |
Allows indexing without storing personal data |