Introduction
2:47 AM: PagerDuty alert: "Securitain AI latency spike"
2:48 AM: Engineer wakes up, checks logs
2:49 AM: "Which AI model failed? What was the input? Did it retry? Was data leaked?"
2:50 AM: Realizes: We can't see what the AI is doing.
That 3 AM wake-up call changed everything. We built observability that doesn't just monitor uptime-it reveals intent, latency, safety metrics, and business impact across every AI interaction.
This article shares how Cloudain transformed AI from a black box into a transparent, auditable system that engineers, compliance teams, and executives can trust.
Observability as the Foundation of Responsible AI
Why AI Needs Different Observability
Traditional Software:
Request → Function → Database → Response
Observable: Latency, errors, throughput
AI Systems:
Request → Intent Classification → Context Loading →
Model Inference → Safety Checks → Response Generation →
Audit Logging → Delivery
Need to Observe:
- Which intent was detected (and confidence)
- Which model was used (and why)
- How much context was loaded (tokens)
- What safety guardrails triggered
- Whether PII was detected/redacted
- Total cost of the interaction
- Business outcome (resolved? escalated?)
The Black Box Problem
Without Observability:
User: "Cancel my subscription"
AI: [some response]
Engineer: "Did it work?"
Team: "¯\_(ツ)_/¯"
With Observability:
User: "Cancel my subscription"
Telemetry:
- Intent: CancelSubscription (confidence: 0.94)
- Model: bedrock-claude-v2 (primary)
- Context: 2,340 tokens loaded
- Policy Check: Requires manager approval
- Guardrails: 0 triggered
- Latency: 680ms
- Cost: $0.042
- Outcome: Approval workflow created
- User Satisfaction: 9/10 (post-interaction survey)
CoreCloud Compliance Integration
Why Compliance Teams Care About AI Observability
SOC 2 Auditor Questions:
- "How do you ensure AI doesn't access unauthorized data?"
- "Can you prove AI decisions are auditable?"
- "Show me access logs for the past 90 days"
HIPAA Requirements:
- Who accessed patient data?
- When was PHI processed by AI?
- Were encryption standards maintained?
GDPR Obligations:
- Right to explanation: Why did AI make this decision?
- Data minimization: Was only necessary data used?
- Retention: When will AI-processed data be deleted?
CoreCloud's Audit Trail
300">interface ComplianceEvent {
eventId: string
timestamp: number
userId: string
brand: string
action: string
intent?: string
// Data access
dataAccessed: {
types: string[] // ["customer_name", "billing_info"]
classification: string // "PII", "PHI", "financial"
justification: string // Why this data was needed
}
// AI-specific
aiMetadata?: {
modelUsed: string
inputTokens: number
outputTokens: number
piiDetected: boolean
piiRedacted: boolean
guardrailsTriggered: string[]
}
// Compliance
complianceFrameworks: string[] // ["SOC2", "GDPR", "HIPAA"]
retentionPolicy: string
// Audit
ipAddress: string
userAgent: string
sessionId: string
}
Example Event:
{
"eventId": "evt_20250122_abc123",
"timestamp": 1705910400000,
"userId": "user_789",
"brand": "securitain",
"action": "ai_query_processed",
"intent": "CheckComplianceStatus",
"dataAccessed": {
"types": ["company_name", "compliance_framework", "audit_logs"],
"classification": "business_confidential",
"justification": "User requested compliance status for their organization"
},
"aiMetadata": {
"modelUsed": "bedrock-claude-v2",
"inputTokens": 1240,
"outputTokens": 580,
"piiDetected": 300">false,
"piiRedacted": 300">false,
"guardrailsTriggered": []
},
"complianceFrameworks": ["SOC2", "GDPR"],
"retentionPolicy": "7_years",
"ipAddress": "203.0.113.42",
"userAgent": "Mozilla/5.0...",
"sessionId": "sess_xyz789"
}
Compliance Dashboards
For Auditors:
// Generate compliance report
300">const report = 300">await CoreCloud.generateComplianceReport({
startDate: 39;2024-01-0139;,
endDate: 39;2024-12-3139;,
framework: 39;SOC239;,
brand: 39;securitain39;
})
// Report includes:
// - Total AI interactions: 2.4M
// - PII access events: 847K (all authorized)
// - Guardrail triggers: 1,203 (all handled correctly)
// - Unauthorized access attempts: 0
// - Data retention compliance: 100%
// - Encryption compliance: 100%
Real-Time Compliance Alerts:
// Alert on suspicious activity
300">await CloudWatch.putMetricAlarm({
AlarmName: 39;UnauthorizedAIDataAccess39;,
MetricName: 39;UnauthorizedAccess39;,
Namespace: 39;Cloudain/Compliance39;,
Threshold: 1,
ComparisonOperator: 39;GreaterThanOrEqualToThreshold39;,
AlarmActions: [
securityTeamSNS,
complianceTeamSNS,
autoBlockUserLambda
]
})
AI Telemetry Streams via AgenticCloud
What We Capture
300">interface AITelemetry {
// Request metadata
requestId: string
timestamp: number
brand: string
userId: string
sessionId: string
// Intent analysis
intent: {
detected: string
confidence: number
alternativeIntents: Array<{
name: string
confidence: number
}>
classificationLatency: number
}
// Context & memory
context: {
tokensLoaded: number
messagesInContext: number
cacheHit: boolean
contextLoadLatency: number
}
// Model inference
model: {
provider: string
modelId: string
parameters: {
temperature: number
maxTokens: number
topP?: number
}
inputTokens: number
outputTokens: number
inferenceLatency: number
cost: number
retryCount: number
fallbackUsed: boolean
}
// Safety & compliance
safety: {
piiDetected: boolean
piiRedacted: boolean
guardrailsChecked: string[]
guardrailsTriggered: string[]
contentFiltered: boolean
}
// Response quality
response: {
completed: boolean
truncated: boolean
characterCount: number
sentimentScore?: number
}
// Performance
performance: {
totalLatency: number
breakdown: {
auth: number
intentClassification: number
contextLoading: number
modelInference: number
safetyChecks: number
responseFormatting: number
}
}
// Business metrics
business: {
resolved: boolean
escalated: boolean
userSatisfaction?: number
followUpRequired: boolean
}
// Errors
errors?: Array<{
300">type: string
message: string
timestamp: number
recovered: boolean
}>
}
Real-Time Streaming
// Stream telemetry to Kinesis
300">async 300">function emitTelemetry(telemetry: AITelemetry) {
300">await Kinesis.putRecord({
StreamName: 39;cloudain-ai-telemetry39;,
PartitionKey: telemetry.brand,
Data: JSON.stringify(telemetry)
})
// Also log to CloudWatch for real-time dashboards
300">await CloudWatch.putMetricData({
Namespace: 39;Cloudain/AgenticCloud39;,
MetricData: [
{
MetricName: 39;AILatency39;,
Value: telemetry.performance.totalLatency,
Unit: 39;Milliseconds39;,
Dimensions: [
{ Name: 39;Brand39;, Value: telemetry.brand },
{ Name: 39;Intent39;, Value: telemetry.intent.detected }
]
},
{
MetricName: 39;TokenUsage39;,
Value: telemetry.model.inputTokens + telemetry.model.outputTokens,
Unit: 39;Count39;,
Dimensions: [
{ Name: 39;Brand39;, Value: telemetry.brand },
{ Name: 39;Model39;, Value: telemetry.model.modelId }
]
},
{
MetricName: 39;AIFost39;,
Value: telemetry.model.cost,
Unit: 39;None39;,
Dimensions: [
{ Name: 39;Brand39;, Value: telemetry.brand }
]
}
]
})
}
Real-Time Dashboards for CX and Compliance
Executive Dashboard
KPIs Tracked:
- Total conversations (hourly, daily, monthly)
- Average cost per conversation
- User satisfaction scores
- Resolution rate
- Escalation rate
- Model performance by brand
Implementation:
// CloudWatch Dashboard
300">const dashboard = {
widgets: [
// Total Conversations
{
300">type: 39;metric39;,
properties: {
metrics: [
[39;Cloudain/AgenticCloud39;, 39;Conversations39;, { stat: 39;Sum39; }]
],
period: 3600,
stat: 39;Sum39;,
region: 39;us-east-139;,
title: 39;Conversations per Hour39;
}
},
// Average Latency by Brand
{
300">type: 39;metric39;,
properties: {
metrics: [
[39;...39;, { stat: 39;Average39;, label: 39;Growain39; }],
[39;...39;, { stat: 39;Average39;, label: 39;Securitain39; }],
[39;...39;, { stat: 39;Average39;, label: 39;MindAgain39; }]
],
title: 39;Average Latency by Brand39;,
yAxis: { left: { min: 0, max: 2000 }}
}
},
// Cost Tracking
{
300">type: 39;metric39;,
properties: {
metrics: [
[39;Cloudain/AgenticCloud39;, 39;AICost39;, { stat: 39;Sum39; }]
],
title: 39;AI Cost (Last 24 Hours)39;,
stat: 39;Sum39;,
period: 86400
}
},
// User Satisfaction
{
300">type: 39;metric39;,
properties: {
metrics: [
[39;Cloudain/Business39;, 39;UserSatisfaction39;, { stat: 39;Average39; }]
],
title: 39;Average User Satisfaction (1-10)39;,
yAxis: { left: { min: 0, max: 10 }}
}
}
]
}
Engineering Dashboard
Technical Metrics:
- P50, P95, P99 latency
- Error rates by type
- Model fallback frequency
- Cache hit rates
- Token usage trends
- Retry and timeout events
// Latency percentiles
300">const latencyMetrics = {
widget: {
300">type: 39;metric39;,
properties: {
metrics: [
[39;Cloudain/AgenticCloud39;, 39;AILatency39;, { stat: 39;p5039; }],
[39;...39;, { stat: 39;p9539; }],
[39;...39;, { stat: 39;p9939; }]
],
title: 39;Latency Percentiles39;,
period: 300
}
}
}
// Error tracking
300">const errorMetrics = {
widget: {
300">type: 39;metric39;,
properties: {
metrics: [
[39;Cloudain/AgenticCloud39;, 39;Errors39;, { stat: 39;Sum39; }],
[39;...39;, 39;IntentClassificationErrors39;, { stat: 39;Sum39; }],
[39;...39;, 39;ModelInferenceErrors39;, { stat: 39;Sum39; }],
[39;...39;, 39;ContextLoadErrors39;, { stat: 39;Sum39; }]
],
title: 39;Error Breakdown39;,
period: 300
}
}
}
Compliance Dashboard
Audit Metrics:
- PII access events
- Guardrail triggers
- Policy violations
- Data retention compliance
- Encryption status
- Access control events
// Compliance widget
300">const complianceWidget = {
widget: {
300">type: 39;log39;,
properties: {
query: 96;
SOURCE 39;cloudain-audit-logs39;
| fields @timestamp, userId, action, dataAccessed
| filter action = 39;pii_accessed39;
| stats count() by userId
| sort count desc
96;,
title: 39;PII Access by User (Last 24h)39;,
region: 39;us-east-139;
}
}
}
// Guardrail monitoring
300">const guardrailWidget = {
widget: {
300">type: 39;metric39;,
properties: {
metrics: [
[39;Cloudain/Safety39;, 39;GuardrailTriggered39;, { stat: 39;Sum39; }]
],
title: 39;Safety Guardrail Activations39;,
annotations: {
horizontal: [{
value: 100,
label: 39;Review Threshold39;
}]
}
}
}
}
Distributed Tracing
X-Ray Integration
// Instrument AI request flow
300">import AWSXRay 300">from 39;aws-xray-sdk-core39;
300">async 300">function processAIRequest(request: Request) {
300">const segment = AWSXRay.getSegment()
// Intent classification subsegment
300">const intentSegment = segment.addNewSubsegment(39;IntentClassification39;)
intentSegment.addAnnotation(39;brand39;, request.brand)
300">const intent = 300">await classifyIntent(request.message)
intentSegment.addMetadata(39;result39;, {
intent: intent.name,
confidence: intent.confidence
})
intentSegment.close()
// Context loading subsegment
300">const contextSegment = segment.addNewSubsegment(39;ContextLoading39;)
300">const context = 300">await loadContext(request.sessionId)
contextSegment.addMetadata(39;tokensLoaded39;, context.tokens)
contextSegment.close()
// Model inference subsegment
300">const modelSegment = segment.addNewSubsegment(39;ModelInference39;)
modelSegment.addAnnotation(39;model39;, 39;bedrock-claude-v239;)
300">const response = 300">await generateResponse(intent, context)
modelSegment.addMetadata(39;tokens39;, {
input: response.inputTokens,
output: response.outputTokens,
cost: response.cost
})
modelSegment.close()
300">return response
}
Trace Visualization:
Request (850ms total)
├─ Auth Verification (15ms)
├─ Intent Classification (120ms)
│ ├─ Load model (45ms)
│ └─ Inference (75ms)
├─ Context Loading (85ms)
│ ├─ Redis lookup (5ms)
│ └─ DynamoDB query (80ms)
├─ Model Inference (580ms) ← Bottleneck
│ ├─ Token encoding (20ms)
│ ├─ Bedrock API call (540ms)
│ └─ Response parsing (20ms)
├─ Safety Checks (35ms)
└─ Response Formatting (15ms)
Alerting and Anomaly Detection
Smart Alerts
// Alert on latency degradation
300">await CloudWatch.putMetricAlarm({
AlarmName: 39;AILatencyDegradation39;,
ComparisonOperator: 39;GreaterThanThreshold39;,
EvaluationPeriods: 2,
MetricName: 39;AILatency39;,
Namespace: 39;Cloudain/AgenticCloud39;,
Period: 300,
Statistic: 39;Average39;,
Threshold: 1000, // 1 second
ActionsEnabled: 300">true,
AlarmActions: [engineeringSNS],
AlarmDescription: 39;AI response time exceeds 1 second39;
})
// Alert on cost spike
300">await CloudWatch.putMetricAlarm({
AlarmName: 39;AICostSpike39;,
ComparisonOperator: 39;GreaterThanThreshold39;,
EvaluationPeriods: 1,
MetricName: 39;AICost39;,
Namespace: 39;Cloudain/AgenticCloud39;,
Period: 3600,
Statistic: 39;Sum39;,
Threshold: 500, // $500/hour
ActionsEnabled: 300">true,
AlarmActions: [
financeTeamSNS,
autoThrottleLambda
]
})
// Alert on guardrail triggers
300">await CloudWatch.putMetricAlarm({
AlarmName: 39;HighGuardrailActivity39;,
ComparisonOperator: 39;GreaterThanThreshold39;,
EvaluationPeriods: 1,
MetricName: 39;GuardrailTriggered39;,
Namespace: 39;Cloudain/Safety39;,
Period: 300,
Statistic: 39;Sum39;,
Threshold: 50,
ActionsEnabled: 300">true,
AlarmActions: [securityTeamSNS]
})
Anomaly Detection
// ML-powered anomaly detection
300">const anomalyDetector = {
MetricName: 39;AILatency39;,
Namespace: 39;Cloudain/AgenticCloud39;,
Stat: 39;Average39;,
Dimensions: [{ Name: 39;Brand39;, Value: 39;growain39; }]
}
300">await CloudWatch.putAnomalyDetector({
...anomalyDetector,
Configuration: {
ExcludedTimeRanges: [
{
StartTime: 300">new Date(39;2024-12-25T00:00:00Z39;),
EndTime: 300">new Date(39;2024-12-26T00:00:00Z39;)
}
]
}
})
// Alert on anomalies
300">await CloudWatch.putMetricAlarm({
AlarmName: 39;AILatencyAnomaly39;,
ComparisonOperator: 39;LessThanLowerOrGreaterThanUpperThreshold39;,
EvaluationPeriods: 2,
Metrics: [
{
Id: 39;m139;,
ReturnData: 300">true,
MetricStat: {
Metric: anomalyDetector,
Period: 300,
Stat: 39;Average39;
}
},
{
Id: 39;ad139;,
Expression: 39;ANOMALY_DETECTION_BAND(m1, 2)39;, // 2 std deviations
Label: 39;AILatency (expected)39;
}
],
ThresholdMetricId: 39;ad139;,
AlarmActions: [engineeringSNS]
})
Log Aggregation and Search
Structured Logging
// Structured log format
300">const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
300">new winston.transports.Console(),
300">new CloudWatchTransport({
logGroupName: 39;cloudain-ai-logs39;,
logStreamName: 96;${brand}-${environment}96;
})
]
})
// Log AI interaction
logger.info(39;AI interaction completed39;, {
requestId: request.id,
userId: user.id,
brand: 39;growain39;,
intent: 39;CampaignAnalysis39;,
latency: 680,
tokens: { input: 1200, output: 350 },
cost: 0.042,
satisfied: 300">true
})
CloudWatch Insights Queries
-- Top 10 slowest intents
fields @timestamp, intent, latency
| filter brand = "growain"
| stats avg(latency) as avg_latency by intent
| sort avg_latency desc
| limit 10
-- Cost by brand (last 24 hours)
fields @timestamp, brand, cost
| stats sum(cost) as total_cost by brand
| sort total_cost desc
-- Error analysis
fields @timestamp, error.300">type, error.message, brand, intent
| filter ispresent(error)
| stats count() as error_count by error.300">type, brand
| sort error_count desc
-- Guardrail triggers
fields @timestamp, userId, guardrail, intent
| filter guardrailsTriggered.0 != ""
| stats count() by guardrail
| sort count desc
Business Intelligence Integration
Data Warehouse Pipeline
// Stream telemetry to S3 for analytics
Kinesis → Firehose → S3 → Glue → Athena/Redshift
// Firehose configuration
300">const firehoseConfig = {
DeliveryStreamName: 39;cloudain-ai-analytics39;,
S3DestinationConfiguration: {
BucketARN: 39;arn:aws:s3:::cloudain-analytics39;,
Prefix: 39;ai-telemetry/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/39;,
BufferingHints: {
SizeInMBs: 128,
IntervalInSeconds: 300
},
CompressionFormat: 39;GZIP39;,
DataFormatConversionConfiguration: {
Enabled: 300">true,
SchemaConfiguration: {
DatabaseName: 39;cloudain_analytics39;,
TableName: 39;ai_telemetry39;,
Region: 39;us-east-139;,
CatalogId: AWS_ACCOUNT_ID
},
OutputFormatConfiguration: {
Serializer: {
ParquetSerDe: {}
}
}
}
}
}
Analytics Queries
-- User engagement patterns
SELECT
brand,
intent,
COUNT(*) as interaction_count,
AVG(user_satisfaction) as avg_satisfaction,
AVG(total_latency) as avg_latency_ms,
SUM(cost) as total_cost
FROM ai_telemetry
WHERE date >= CURRENT_DATE - INTERVAL 39;3039; DAY
GROUP BY brand, intent
ORDER BY interaction_count DESC
-- Model performance comparison
SELECT
model_id,
AVG(inference_latency) as avg_latency,
AVG(output_tokens) as avg_output_length,
AVG(cost) as avg_cost,
SUM(CASE WHEN errors IS NOT NULL THEN 1 ELSE 0 END) as error_count
FROM ai_telemetry
WHERE date = CURRENT_DATE
GROUP BY model_id
-- Cost trends
SELECT
DATE_TRUNC(39;hour39;, timestamp) as hour,
brand,
SUM(cost) as hourly_cost,
COUNT(*) as conversation_count,
SUM(cost) / COUNT(*) as cost_per_conversation
FROM ai_telemetry
WHERE date >= CURRENT_DATE - INTERVAL 39;739; DAY
GROUP BY hour, brand
ORDER BY hour DESC
Privacy-Preserving Observability
PII Redaction in Logs
// Redact PII before logging
300">async 300">function logWithRedaction(event: any) {
300">const redacted = 300">await redactPII(event)
logger.info(39;AI interaction39;, redacted)
// Store PII mapping separately (encrypted, short TTL)
300">if (event.containsPII) {
300">await CoreCloud.storePIIMapping({
requestId: event.requestId,
mapping: event.piiMapping,
ttl: 300 // 5 minutes
})
}
}
Aggregated Metrics Only
// Don39;t log individual user queries
// ✗ BAD:
logger.info(39;User asked: "What is my credit card number?"39;)
// ✓ GOOD:
logger.info(39;Intent classified39;, {
intent: 39;AccountInquiry39;,
confidence: 0.89,
piiDetected: 300">true,
piiRedacted: 300">true
// No actual query text logged
})
Conclusion
AI observability transforms black boxes into transparent, auditable systems. By implementing comprehensive telemetry, compliance dashboards, and real-time monitoring, Cloudain built AI that:
Engineers Trust:
- Complete visibility into performance
- Fast troubleshooting with distributed tracing
- Proactive alerts prevent outages
Compliance Teams Trust:
- 100% audit trail coverage
- Real-time compliance monitoring
- Automated evidence for SOC 2/HIPAA/GDPR
Business Leaders Trust:
- Clear cost attribution
- User satisfaction tracking
- ROI measurement
Key principles:
- Instrument everything from intent to response
- Aggregate for privacy while preserving insights
- Stream to multiple sinks (CloudWatch, S3, analytics)
- Alert intelligently using anomaly detection
- Make it searchable with structured logs
Results:
- <3 min mean time to detection (MTTD)
- <10 min mean time to resolution (MTTR)
- 100% compliance audit pass rate
- 24/7 visibility into AI operations
Observability isn't optional-it's the foundation of responsible AI.
Build Observable AI Systems
Ready to see inside your AI?
Schedule an Observability Assessment →
Learn how CoreCloud and AgenticCloud deliver complete AI visibility.
Cloudain SRE Team
Expert insights on AI, Cloud, and Compliance solutions. Helping organisations transform their technology infrastructure with innovative strategies.
