AWS Bedrock AgentCore Evaluations: The GDPR Risk Nobody's Talking About

Table of Contents

How I Fell Into the Compliance Rabbit Hole
#

AWS re:Invent announced Amazon Bedrock AgentCore Evaluations. Managed LLM-as-judge for your AI agents — sounds great, right?
So I did what any architect does: I read the documentation to understand how to integrate it.

What started as “let me evaluate this new AWS service” turned into a deep dive that revealed a serious GDPR compliance risk hiding in plain sight. Not with Bedrock AgentCore Evaluations itself — but with the entire observability pattern we’ve all been using for AI agents.

And if you’re building AI agents in production, you need to know about this.

To be clear: this isn’t an “AWS is bad” problem. It exists across AWS, Azure, GCP, Datadog, and pretty much every observability stack. These systems were designed for debuggability and performance — not per-user erasure under GDPR.

The Two Integration Paths
#

Amazon Bedrock AgentCore Evaluations gives you two options:

Online evaluation: Reads conversations directly from CloudWatch Logs + OpenTelemetry
On-demand evaluation: Requires extra steps on the agent side (at which point, why not just run your own LLM-as-judge?)

I started with online evaluation because it seemed simpler.

The CloudWatch Trap
#

Online evaluation means shipping your entire conversation to CloudWatch Logs.

I live in the EU. I architect data systems. I breathe GDPR compliance. And here’s the truth:

Your conversation history WILL contain PII. Always.

Users naturally share:

Names, locations, email addresses (direct identifiers)
Job details, family information, health data (indirect identifiers)
Unique writing styles and behavioral patterns
Contextual details that combine to identify them

You cannot scrub this data without destroying the functionality of your agent. And even if you think you can anonymize it — GDPR says pseudonymized data is still personal data. You must delete the actual data, not just the user-to-ID mapping.

The killer: GDPR’s right to erasure. Users can request deletion within 1 month. CloudWatch? Partition-based deletion only. You can’t delete a single user’s history.

“But wait,” you might say, “what about Amazon Bedrock AgentCore Observability?”

Yeah. That’s just a wrapper around CloudWatch + X-Ray. Same immutable storage. Same partition-based deletion. Same GDPR problem.

Then Came the Traces
#

Reading further, AWS documentation says:

AgentCore Evaluations integrates with popular agent frameworks including Strands and LangGraph with OpenTelemetry and OpenInference instrumentation libraries…

Wait. Traces?

Classic distributed traces don’t contain message content, right? So I dug into Strands documentation (my framework of choice).

The oh-no moment
#

Check the example trace output. Messages are part of the trace data.

Then I went down the OpenTelemetry GenAI spec rabbit hole. According to the official examples, traces explicitly include:

Full user messages
AI responses
Retrieved context
Function arguments

This creates a serious GDPR problem.

Where Your PII Actually Lives
#

Here’s what nobody tells you when you set up AI agent observability:

Your PII is now in at minimum three places:

Your database (hopefully deletable)
CloudWatch Logs (immutable, partition-based deletion only)
OpenTelemetry traces → Datadog / New Relic / Honeycomb (also immutable)

Typical pattern:

User → AI Agent → Database (deletable)
       ↓
 CloudWatch / X-Ray (immutable)
       ↓
OpenTelemetry → Datadog (immutable)

Why this breaks GDPR expectations:

No granular deletion
Full message content in immutable storage
Retention often 30–90+ days
Multiple processors involved

You can delete records from PostgreSQL instantly.
But the same conversation still lives in CloudWatch and Datadog — and you can’t remove it.

The Third-Party Processor Nightmare
#

You typically have two tracing paths:

AWS X-Ray

Stays within AWS
Covered by your existing AWS DPA
Still immutable, partition-based deletion

Third-party (Datadog, New Relic, Honeycomb)

Additional data processors
Cross-company data flow
Often US-based
No granular deletion

What I see in practice:
Teams start with X-Ray, then export to Datadog anyway. Or they send logs to both CloudWatch and another platform.

Now full conversations exist in multiple immutable systems across multiple processors.

This is often happening by default, just by following “observability quickstart” guides.

The Access Control Blindspot
#

Even before deletion, there’s another issue:

Everyone with access to observability tools can read user conversations.

Developers debugging bugs? See PII.
SREs investigating latency? See PII.
Support teams? See PII.
Contractors? Also see PII.

This violates core GDPR principles:

Data minimization
Purpose limitation
Access control / least privilege

And this is the default behaviour of many AI agent frameworks.

Architectural separation is not optional. It’s the only sane approach.

What You Should Actually Do
#

Here’s the pattern that actually works.

1. Architectural Separation (Non-negotiable)
#

Deletable storage (PostgreSQL, DynamoDB, etc.)

Full chat messages
User profiles
All PII
Granular, immediate deletion
Strict access control

Immutable systems (CloudWatch, X-Ray, Datadog)

Metadata ONLY
Token counts
Latency
Error codes
Hashed user IDs (one-way)
No message content
30-day retention max

When there’s no PII, developers can safely access these systems.

2. Fix Your OpenTelemetry Instrumentation
#

Default instrumentation is the real danger.

Configure it to send:

✅ Latency
✅ Token usage
✅ Error type
✅ Hashed user ID

❌ Full prompts
❌ Model responses
❌ Retrieved context
❌ Function arguments with PII

I’ve already opened a feature request with Strands to make this the default. AI frameworks should be compliant by design, not compliant by extra effort.

3. The 7–14 Day Rule (If you can’t fix it yet)
#

If you absolutely must send message content (legacy reasons, framework limitations, transition period), your retention matters.

30 days – common and defensible, but extremely tight
90+ days – almost impossible to justify for chat data
7–14 days – much safer, still operationally useful

This is mitigation, not a solution.
The real fix is: no content in immutable systems.

Quick Audit Checklist
#

If you’re running AI agents in prod, ask yourself:

Do any logs or traces contain full prompts or responses?
Can I delete a single user’s data from all systems?
Who inside the company can query these logs?
Is retention longer than 30 days?
Do third-party tools receive this data?

If any of these make you uncomfortable — you’ve found your starting point.

The Bottom Line
#

The default observability pattern for AI agents creates major GDPR risk.

It’s not about CloudWatch vs Datadog. It’s about what you send to them.

As long as AI frameworks export full conversations to immutable systems, you have:

No granular deletion
Excessive retention
Over-broad internal access
Multiple data processors

The fix is architectural and simple:

Chat content → Deletable storage only
Observability → Metadata only
No messages in traces/logs
Strict access control

What’s Next
#

Honestly? I’m still processing this.

What started as “let me evaluate a new AWS service” turned into realizing that the entire AI observability ecosystem has a compliance problem baked into its defaults.

If you’re building AI systems in production:

Audit your pipeline. Fix your retention. Separate your storage.

We’ll figure out the rest as we go.

Building AI in the EU? Send me your war stories. Misery loves company.
— The Pragmatical Architect

Author

Andor Markus

Data & AI Architect in professional services. I spend my days implementing solutions, not just designing them. I share what we learn when theory meets reality—both at work and through personal projects. The wins, the struggles, and what actually sticks.

How I Fell Into the Compliance Rabbit Hole #

The Two Integration Paths #

The CloudWatch Trap #

Then Came the Traces #

The oh-no moment #

Where Your PII Actually Lives #

The Third-Party Processor Nightmare #

The Access Control Blindspot #

What You Should Actually Do #

1. Architectural Separation (Non-negotiable) #

2. Fix Your OpenTelemetry Instrumentation #

3. The 7–14 Day Rule (If you can’t fix it yet) #

Quick Audit Checklist #

The Bottom Line #

What’s Next #