Building GDPR-Compliant AI from Scratch: Our Technical Approach
Why We Built Everything from Scratch
When we started Garnet AI, we had a choice: use existing AI APIs (OpenAI, Google, etc.) or build our own proprietary engine. We chose the harder path — and here's why.
The Problem with Third-Party AI APIs
Using third-party AI APIs for compliance document processing creates several issues:
- Data sovereignty: Vendor compliance documents contain sensitive information. Sending them to US-hosted APIs means data leaves the EU jurisdiction.
- GDPR compliance: Under GDPR, processing personal data outside the EU requires specific legal mechanisms (SCCs, adequacy decisions). For sensitive compliance data, this creates unnecessary risk.
- Data retention: Most API providers retain input data for model improvement. For compliance documents, this is unacceptable.
- Audit trail: Regulators want to know exactly how data was processed. With third-party APIs, you have limited visibility into the processing pipeline.
Our Architecture
Garnet's AI stack is built on three proprietary components:
1. OCR Engine
Our OCR engine is purpose-built for compliance documents. Unlike general-purpose OCR (Tesseract, Google Vision), it understands:
- Multi-column layouts common in audit reports
- Table structures in control matrices
- Watermarks and redactions without misinterpreting them
- Low-quality scans from documents that have been printed, signed, and re-scanned
Accuracy: 99.4% character accuracy on compliance document benchmarks.
2. Compliance AI Model
Our AI model is trained specifically on compliance document structures:
- SOC 2 Type I and Type II reports
- ISO 27001 certificates and statements of applicability
- Penetration test reports (various frameworks)
- Data Processing Agreements
- Bridge letters and management assertions
The model understands context, not just keywords. It knows that "qualified opinion" in Section 4 of a SOC 2 report has different implications than "qualified" in a job description.
3. EU-Sovereign Infrastructure
All processing happens on EU-hosted infrastructure:
- Zero data retention: Documents are processed in-memory and purged immediately
- No external API calls: Everything runs on our own infrastructure
- Full audit trail: Every processing step is logged for regulatory reporting
- Encryption in transit and at rest: AES-256 encryption throughout
The Trade-offs
Building proprietary means:
- Slower initial development: We spent months building what an API call could do in days
- Higher infrastructure costs: Running our own GPU clusters isn't cheap
- Smaller model: Our model is smaller than GPT-4, but it's better at compliance documents
The trade-off is worth it. Our customers' data never leaves the EU, we have full control over the processing pipeline, and we can provide complete audit trails to regulators.
What's Next
We're continuously improving our models with structured feedback from alpha users. Every false positive and missed exception makes the system more accurate.
The goal: 99%+ exception detection rate with zero data leaving the EU.