The AI doesn't need everything in the file — we send only what it actually needs.
PortEden's context-hygiene engine strips hidden metadata, dedupes overlapping sources, and prunes off-topic content from every context bundle before it reaches Claude, ChatGPT, Copilot, or Gemini. Data minimization for AI — mapped to GDPR Art. 5(1)(c) and HIPAA Minimum Necessary, with per-request audit evidence.
Free tier · No credit card · Works with any AI client
Files carry far more than what's visible on screen.
Even when access control allows the file and redaction masks the identifiers, the context bundle the AI receives is loaded with hidden metadata, duplicate sources, and off-topic content. The visible body is a small fraction of what actually ships to Claude, ChatGPT, Copilot, or Gemini. Every byte is a billing line, a latency cost, and a leak surface.
Word, Excel, and PDF files carry their history
Track changes, revision authors, embedded comments, and previous edits all flow into the model alongside the visible content. The AI doesn't know the difference between today's draft and last quarter's rejected pricing.
Email threads quote everything
Every reply quotes the entire prior thread; uploading one message uploads months of conversation, including off-topic asides, internal disagreements, and the customer complaint from three quarters ago that's no longer relevant.
Spreadsheets have hidden columns and rows
The visible sheet looks fine; the data the AI ingests includes the hidden "salary" column and last quarter's draft pricing. Nothing in the file viewer warns you, and nothing in the AI client warns you either.
One file, one prompt — two very different bundles.
The same Word document attached to the same prompt. On the left, what you'd send if nothing minimized it. On the right, what PortEden actually transmits to the model.
- Original document body1 file
- Tracked changes+ 14 edits
- Revision history+ 8 prior versions
- Comments thread+ 12 comments
- Author metadata (names, emails, locations)+ 6 identities
- EXIF (geolocation, device IDs)+ 4 fields
- Hidden columns and sheets+ 3 hidden
- Email-thread quotes (prior replies)+ 5 quoted
- Tracking pixels+ 2 beacons
- Document body, current revision1 file
- Comments tagged @ai-include+ 2 kept
- No tracked changes0
- No prior revisions0
- No author metadata0
- No EXIF or geolocation0
- No hidden cells0
- Immediate email message only1
- No tracking pixels0
85% smaller, 100% relevant. Faster, cheaper, less to leak.
Six categories of context, all cleaned at the boundary.
Format-aware parsers know exactly where the metadata hides. Cross-source deduplication and a topic classifier handle what the parsers can't see. Every category applies uniformly across every integration and every AI client.
Document metadata
What hides in a Word, PowerPoint, or PDF file?
A 4-page proposal carries the author's full name, last 8 revisions, 12 comments from internal review, and an embedded font catalog.
- Author names and corporate identities
- Last-modified timestamps and edit duration
- Revision history (every prior version)
- Tracked changes (every accepted and rejected edit)
- Comment threads and their authors
- Document properties and custom XML
- Embedded font names and license metadata
Email metadata
What ships with an email upload?
A single forwarded message brings 5 quoted prior emails, 2 tracking pixels, the full recipient chain, and 14 X-headers naming internal mail servers.
- Tracking pixels and read-receipt beacons
- Prior thread quotes (every nested reply)
- Full recipient chains (To, Cc, Bcc, Resent-To)
- X-headers naming internal infrastructure
- Message IDs and threading metadata
- Reply-to chains and routing history
Spreadsheet hygiene
What does a spreadsheet bring that you didn't intend?
An Excel file with one visible tab carries 3 hidden sheets, 6 hidden columns, formulas referencing a fileshare path, and 47 cell comments left from review.
- Hidden rows, columns, and entire worksheets
- Formulas that leak source paths and shares
- Cell comments and threaded discussions
- Conditional-formatting rules and named ranges
- Pivot-cache remnants from prior queries
- Custom-defined names referencing other workbooks
- External-data connection strings
PDF & image hygiene
What's hiding in a PDF or photo?
A scanned PDF carries device IDs, GPS coordinates from the phone that captured it, and "black-rectangle" redactions that aren't actually flattened.
- EXIF blocks (camera, lens, software)
- Geolocation (latitude, longitude, altitude)
- Device identifiers and serial numbers
- Embedded JavaScript and form scripts
- PDF form metadata and field history
- Redaction artifacts that aren't truly flattened
Source dedup
What overlaps when multiple integrations are in play?
A Drive folder, a Slack channel, and a wiki page all contain the same Q3 plan paragraph in slightly different drafts. The AI gets it three times.
- Cross-integration paragraph similarity matching
- Near-duplicate collapse (≥85% similarity threshold)
- Canonical-instance selection by recency and source authority
- Citation paths preserved on collapsed duplicates
- Quote-of-quote detection across email and chat
- Snippet-level dedup inside long documents
Stale-context pruning
What is no longer in scope for this prompt?
An email thread spans 8 months of unrelated topics; only the last 3 weeks of replies are about the customer issue the user is asking about.
- Topic classifier scored against the user prompt
- Per-integration recency windows (e.g., Slack 30d, wiki 12mo)
- Time-decayed weighting on borderline paragraphs
- Out-of-policy-window content dropped automatically
- @ai-include directive overrides for canonical content
- Drop reasoning logged for every paragraph
Strip. Dedupe. Prune. Minimize.
1. Strip
Each file is parsed by a format-aware parser (Word, Excel, PDF, MSG, source code) that knows where the metadata hides. Revision XML, hidden sheets, tracking pixels, EXIF blocks, and embedded scripts are removed; only the visible-content layer survives.
2. Dedupe
Paragraphs across every source in the bundle are compared with MinHash + small-embedding cosine similarity. Near-duplicates collapse to one canonical instance with citation paths preserved, so the model can still attribute without re-reading the same content.
3. Prune
A topic classifier scores each surviving paragraph against the user's prompt; a recency filter applies the policy time window. Off-topic paragraphs and out-of-window content drop, with the rule name and score recorded for every decision.
4. Minimize
A final pass trims the bundle to the configured token budget using priority weights (recency, relevance, @ai-include directives) so the most-relevant content survives first. The result is a tight, on-topic bundle the model can actually use.
A real prompt, a real bundle, a real diff.
A user asks Claude to summarize the Q3 deal pipeline with a Drive folder attached. Here's what would have shipped to the model — and here's what actually did, after hygiene ran.
"Summarize Q3 deal pipeline."
Attached: /Drive/Sales/Q3-pipeline/ (12 files, last modified within 90 days)
- Files12
- Total tokens47,200
- Revision-history entries31
- Comments84
- Author identities22
- Hidden columns6
- Files12
- Total tokens6,840
- Revision-history entries0
- @ai-include-tagged comments9
- Author identities0
- Hidden cells0
- Removed: 18 tracked changes from a drafted-and-rejected pricing tab — Reason: tracked-changes-strip rule
- Removed: 5 prior revisions of "Q3-plan-FINAL-v3.docx" — Reason: revision-history-strip rule
- Removed: 73 comments with no @ai-include tag — Reason: comment-prune rule (default)
- Removed: EXIF + geolocation on 4 customer-supplied PDFs — Reason: pdf-metadata-strip rule
The same prompt, two very different bundles.
Citations, not vague reassurances.
Each hygiene control maps to a specific clause in the framework your auditor is reading. Evidence is exportable from the audit trail with the rule that fired and a hash of the dropped content for every decision.
Every source the AI pulls context from.
One hygiene engine, six regulated workflows.
Format-aware, composable, and auditable.
Generic regex misses revision XML, hidden sheets, and EXIF blocks. PortEden ships a format-aware parser for every major file type, composes cleanly with the redaction engine, and produces a per-decision audit record so an auditor never has to ask "what was sent and what wasn't?"
Format-aware parsers
Word, Excel, PDF, Outlook MSG, Markdown, source code each get a parser that knows where the metadata hides. Generic regex doesn't catch revision XML, hidden sheets, or PDF form scripts.
Composable with redaction
Context hygiene strips structural metadata; the redaction engine masks identifiers in the visible content. Combined, they cover both layers — the metadata you didn't see and the values you did.
Auditable minimization
Every dropped section is logged with the rule that fired and a hash of the original. Auditors get a precise, per-request answer to "what was sent and what wasn't?" without reconstructing it.
Pairs well with
Context hygiene questions
What is AI context hygiene and how is it different from redaction?
What kinds of metadata get stripped from a context bundle?
Will the AI lose useful context when this runs?
How does it handle file formats I haven't listed?
Can I tag content as @ai-include to keep it through pruning?
Does it dedupe across multiple integrations?
How is "stale" or "off-topic" content decided?
Will my AI output be smaller, faster, and cheaper as a result?
What evidence does this produce for GDPR Art. 5(1)(c) data-minimisation auditors?
What evidence does this produce for HIPAA Minimum Necessary auditors?
Are hygiene actions logged?
What pricing tier includes context hygiene?
Send less to the model. Get faster, cheaper, safer answers.
Set up context hygiene in under 10 minutes. Free tier covers solo users; Enterprise adds SSO/SAML, SCIM, change-control workflows, and SIEM export of every hygiene decision.