Skip to content
Product · AI Context Hygiene

The AI doesn't need everything in the file — we send only what it actually needs.

PortEden's context-hygiene engine strips hidden metadata, dedupes overlapping sources, and prunes off-topic content from every context bundle before it reaches Claude, ChatGPT, Copilot, or Gemini. Data minimization for AI — mapped to GDPR Art. 5(1)(c) and HIPAA Minimum Necessary, with per-request audit evidence.

See pricing

Free tier · No credit card · Works with any AI client

Mapped to the frameworks your auditor reads
GDPR Art. 5(1)(c)HIPAA Minimum NecessaryGDPR Art. 25SOC 2 CC6.7ISO 27001 A.5.34CCPA / CPRANIST 800-53 SC-28
The problem

Files carry far more than what's visible on screen.

Even when access control allows the file and redaction masks the identifiers, the context bundle the AI receives is loaded with hidden metadata, duplicate sources, and off-topic content. The visible body is a small fraction of what actually ships to Claude, ChatGPT, Copilot, or Gemini. Every byte is a billing line, a latency cost, and a leak surface.

Word, Excel, and PDF files carry their history

Track changes, revision authors, embedded comments, and previous edits all flow into the model alongside the visible content. The AI doesn't know the difference between today's draft and last quarter's rejected pricing.

Email threads quote everything

Every reply quotes the entire prior thread; uploading one message uploads months of conversation, including off-topic asides, internal disagreements, and the customer complaint from three quarters ago that's no longer relevant.

Spreadsheets have hidden columns and rows

The visible sheet looks fine; the data the AI ingests includes the hidden "salary" column and last quarter's draft pricing. Nothing in the file viewer warns you, and nothing in the AI client warns you either.

The firewall in action

One file, one prompt — two very different bundles.

The same Word document attached to the same prompt. On the left, what you'd send if nothing minimized it. On the right, what PortEden actually transmits to the model.

What you'd send
Bloated
  • Original document body1 file
  • Tracked changes+ 14 edits
  • Revision history+ 8 prior versions
  • Comments thread+ 12 comments
  • Author metadata (names, emails, locations)+ 6 identities
  • EXIF (geolocation, device IDs)+ 4 fields
  • Hidden columns and sheets+ 3 hidden
  • Email-thread quotes (prior replies)+ 5 quoted
  • Tracking pixels+ 2 beacons
Total~12,400 tokens
What PortEden sends
Minimized
  • Document body, current revision1 file
  • Comments tagged @ai-include+ 2 kept
  • No tracked changes0
  • No prior revisions0
  • No author metadata0
  • No EXIF or geolocation0
  • No hidden cells0
  • Immediate email message only1
  • No tracking pixels0
Total~1,800 tokens

85% smaller, 100% relevant. Faster, cheaper, less to leak.

Coverage

Six categories of context, all cleaned at the boundary.

Format-aware parsers know exactly where the metadata hides. Cross-source deduplication and a topic classifier handle what the parsers can't see. Every category applies uniformly across every integration and every AI client.

Document metadata

What hides in a Word, PowerPoint, or PDF file?

A 4-page proposal carries the author's full name, last 8 revisions, 12 comments from internal review, and an embedded font catalog.

  • Author names and corporate identities
  • Last-modified timestamps and edit duration
  • Revision history (every prior version)
  • Tracked changes (every accepted and rejected edit)
  • Comment threads and their authors
  • Document properties and custom XML
  • Embedded font names and license metadata

Email metadata

What ships with an email upload?

A single forwarded message brings 5 quoted prior emails, 2 tracking pixels, the full recipient chain, and 14 X-headers naming internal mail servers.

  • Tracking pixels and read-receipt beacons
  • Prior thread quotes (every nested reply)
  • Full recipient chains (To, Cc, Bcc, Resent-To)
  • X-headers naming internal infrastructure
  • Message IDs and threading metadata
  • Reply-to chains and routing history

Spreadsheet hygiene

What does a spreadsheet bring that you didn't intend?

An Excel file with one visible tab carries 3 hidden sheets, 6 hidden columns, formulas referencing a fileshare path, and 47 cell comments left from review.

  • Hidden rows, columns, and entire worksheets
  • Formulas that leak source paths and shares
  • Cell comments and threaded discussions
  • Conditional-formatting rules and named ranges
  • Pivot-cache remnants from prior queries
  • Custom-defined names referencing other workbooks
  • External-data connection strings

PDF & image hygiene

What's hiding in a PDF or photo?

A scanned PDF carries device IDs, GPS coordinates from the phone that captured it, and "black-rectangle" redactions that aren't actually flattened.

  • EXIF blocks (camera, lens, software)
  • Geolocation (latitude, longitude, altitude)
  • Device identifiers and serial numbers
  • Embedded JavaScript and form scripts
  • PDF form metadata and field history
  • Redaction artifacts that aren't truly flattened

Source dedup

What overlaps when multiple integrations are in play?

A Drive folder, a Slack channel, and a wiki page all contain the same Q3 plan paragraph in slightly different drafts. The AI gets it three times.

  • Cross-integration paragraph similarity matching
  • Near-duplicate collapse (≥85% similarity threshold)
  • Canonical-instance selection by recency and source authority
  • Citation paths preserved on collapsed duplicates
  • Quote-of-quote detection across email and chat
  • Snippet-level dedup inside long documents

Stale-context pruning

What is no longer in scope for this prompt?

An email thread spans 8 months of unrelated topics; only the last 3 weeks of replies are about the customer issue the user is asking about.

  • Topic classifier scored against the user prompt
  • Per-integration recency windows (e.g., Slack 30d, wiki 12mo)
  • Time-decayed weighting on borderline paragraphs
  • Out-of-policy-window content dropped automatically
  • @ai-include directive overrides for canonical content
  • Drop reasoning logged for every paragraph
How it works

Strip. Dedupe. Prune. Minimize.

1. Strip

Each file is parsed by a format-aware parser (Word, Excel, PDF, MSG, source code) that knows where the metadata hides. Revision XML, hidden sheets, tracking pixels, EXIF blocks, and embedded scripts are removed; only the visible-content layer survives.

2. Dedupe

Paragraphs across every source in the bundle are compared with MinHash + small-embedding cosine similarity. Near-duplicates collapse to one canonical instance with citation paths preserved, so the model can still attribute without re-reading the same content.

3. Prune

A topic classifier scores each surviving paragraph against the user's prompt; a recency filter applies the policy time window. Off-topic paragraphs and out-of-window content drop, with the rule name and score recorded for every decision.

4. Minimize

A final pass trims the bundle to the configured token budget using priority weights (recency, relevance, @ai-include directives) so the most-relevant content survives first. The result is a tight, on-topic bundle the model can actually use.

See it in action

A real prompt, a real bundle, a real diff.

A user asks Claude to summarize the Q3 deal pipeline with a Drive folder attached. Here's what would have shipped to the model — and here's what actually did, after hygiene ran.

The user prompt
With Drive folder attached

"Summarize Q3 deal pipeline."

Attached: /Drive/Sales/Q3-pipeline/ (12 files, last modified within 90 days)

Input bundle (raw)
Pre-hygiene
  • Files12
  • Total tokens47,200
  • Revision-history entries31
  • Comments84
  • Author identities22
  • Hidden columns6
After-hygiene bundle
Sent to model
  • Files12
  • Total tokens6,840
  • Revision-history entries0
  • @ai-include-tagged comments9
  • Author identities0
  • Hidden cells0
What was removed and why
  • Removed: 18 tracked changes from a drafted-and-rejected pricing tab — Reason: tracked-changes-strip rule
  • Removed: 5 prior revisions of "Q3-plan-FINAL-v3.docx" — Reason: revision-history-strip rule
  • Removed: 73 comments with no @ai-include tag — Reason: comment-prune rule (default)
  • Removed: EXIF + geolocation on 4 customer-supplied PDFs — Reason: pdf-metadata-strip rule
With and without context hygiene

The same prompt, two very different bundles.

Uploading a Word doc with tracked changes to Claude
Without
Every accepted and rejected edit, every comment, every prior author, and the full revision history flow into the model. The AI summary cites a rejected pricing draft as if it were current.
With
Hygiene strips revision XML and tracked changes; @ai-include comments survive, the rest don't. The model sees the current document only — and the audit log records what was withheld and why.
Sharing a 200-message Slack thread for AI summary
Without
All 200 messages, including off-topic asides, internal jokes, and DMs accidentally pasted into the channel, ship to OpenAI as one big blob.
With
Topic classifier keeps only the messages about the prompt subject. Off-topic and out-of-window messages drop. The summary is tighter, the cost is lower, and the audit log shows what was filtered.
Asking ChatGPT to summarize a quarter's emails on a customer
Without
Every quoted prior thread, every internal forward, every tracking pixel, every X-header, and every cc'd colleague's signature flows into the context window — half of it duplicate, half of it irrelevant.
With
Email parser strips quote chains and headers; cross-integration dedup collapses repeats; recency window keeps the relevant 90 days. The bundle goes from 60K tokens to 7K with no loss of meaning.
Sharing a financial spreadsheet with hidden columns
Without
Hidden columns and sheets, including the salary tab and last quarter's draft pricing, ingest silently. The AI cites them in its answer because to the model they're just data.
With
Spreadsheet parser drops hidden rows, columns, and sheets at parse time. Visible cells only. Comments tagged @ai-include survive; the rest are logged as withheld.
Sharing a customer-supplied PDF for AI review
Without
EXIF data including the customer's device ID and GPS coordinates ships to the AI vendor. "Black-rectangle" redactions that weren't flattened are visible to the model's text extractor.
With
PDF parser strips EXIF, geolocation, embedded JavaScript, and form metadata; flattens any unflattened redactions before extraction. Only the intended content reaches the model.
Summarizing a document set with overlapping versions
Without
v1, v2, v2-final, v2-final-2, and v3-FINAL of the same plan all upload. The AI reads contradictory drafts side by side and produces a confused summary that mixes them.
With
Cross-integration dedup collapses near-duplicates to one canonical (most-recent) version. Older drafts are logged as withheld; the model gets one coherent source per topic.
Auditor-readable

Citations, not vague reassurances.

Each hygiene control maps to a specific clause in the framework your auditor is reading. Evidence is exportable from the audit trail with the rule that fired and a hash of the dropped content for every decision.

Framework
Citation
PortEden control
GDPR
Art. 5(1)(c) — Data minimisation
Personal data "adequate, relevant and limited to what is necessary." Hygiene engine enforces minimisation at egress; every drop produces evidence.
GDPR
Art. 25 — Data protection by default
Minimisation is the default state of the system, not opt-in. Bundles are hygiene-processed before transmission to any AI client unless explicitly overridden.
HIPAA
§164.502(b) Minimum Necessary
Uses, disclosures, and requests of PHI limited to the minimum necessary. Per-integration minimisation rules with per-request audit evidence.
HIPAA
§164.514(d) Minimum Necessary requirements
Identification of persons or classes that need access, criteria limiting the disclosure, and review process. PortEden's policy groups + audit trail produce evidence for all three.
SOC 2
CC6.7 — Confidential transmission
Confidential information identified and protected during transmission. Hygiene reduces the confidential surface area before any prompt leaves the perimeter.
CCPA / CPRA
§1798.140(ag) — Service-provider purpose limitation
Personal information used only for the contracted purpose. Hygiene strips out-of-scope context before transmission to the AI service-provider.
ISO 27001
A.5.34 — PII protection
Privacy and protection of PII processed by the organization. Format-aware metadata stripping applied to every bundle leaving the integration boundary.
NIST 800-53
SC-28 — Information at rest
Protection of information at rest, applied here to context bundles in transit through PortEden's perimeter to the AI client. Encrypted, minimized, and logged.
Where context hygiene runs

Every source the AI pulls context from.

Gmail
Gmail
Outlook
Outlook
Google Calendar
Google Calendar
Google Drive
Google Drive
Entra ID
Entra ID
Slack
Slack
Microsoft Teams
Microsoft Teams
Jira
Jira
Confluence
Confluence
Notion
Notion
Asana
Asana
Linear
Linear
Use cases

One hygiene engine, six regulated workflows.

Architecture

Format-aware, composable, and auditable.

Generic regex misses revision XML, hidden sheets, and EXIF blocks. PortEden ships a format-aware parser for every major file type, composes cleanly with the redaction engine, and produces a per-decision audit record so an auditor never has to ask "what was sent and what wasn't?"

Format-aware parsers

Word, Excel, PDF, Outlook MSG, Markdown, source code each get a parser that knows where the metadata hides. Generic regex doesn't catch revision XML, hidden sheets, or PDF form scripts.

Composable with redaction

Context hygiene strips structural metadata; the redaction engine masks identifiers in the visible content. Combined, they cover both layers — the metadata you didn't see and the values you did.

Auditable minimization

Every dropped section is logged with the rule that fired and a hash of the original. Auditors get a precise, per-request answer to "what was sent and what wasn't?" without reconstructing it.

Available in: Pro, Business, Enterprise tiers · See pricing

Context hygiene questions

What is AI context hygiene and how is it different from redaction?
Redaction masks sensitive values inside the visible content — names, account numbers, secrets — so the model can't read them. Context hygiene operates one layer up: even before redaction, it decides which parts of a file or thread should be in the context bundle at all. A Word document carries tracked changes, revision history, comments, author metadata, embedded fonts, and document properties. A spreadsheet hides columns and sheets. An email quotes every prior reply. Context hygiene strips that structural metadata, dedupes overlapping sources, and prunes off-topic and stale content. Redaction is for what's visible. Hygiene is for what's loaded.
What kinds of metadata get stripped from a context bundle?
PortEden's hygiene engine knows the metadata surfaces of every file format it parses. Word: revision history, tracked changes, comments, author names, last-modified timestamps, custom XML, embedded font names, document properties. Excel: hidden rows, hidden columns, hidden sheets, comment threads, conditional-formatting rules, named-range remnants, formulas pointing to source paths. PDF: EXIF, geolocation, device IDs, embedded JavaScript, form metadata, redaction artifacts that were never properly flattened. Email: tracking pixels, prior thread quotes, recipient chains, X-headers, message IDs, reply-to chains. Source code: .git directories, embedded credentials in tests, stale TODO comments. None of it is in the visible content; all of it flows into the model unless something strips it.
Will the AI lose useful context when this runs?
No — that's the design point. The hygiene engine only removes content that is structural metadata, an exact or near-duplicate of content already in the bundle, demonstrably out of policy time-window, or off-topic per a topic classifier scored against the user's prompt. Visible body text, the most-recent revision, comments tagged @ai-include, and on-topic paragraphs all survive. In our internal benchmarks, hygiene reduces bundle size 80–90% with zero loss of answer quality on a paired set of summarization, extraction, and Q&A tasks. If a hygiene rule ever drops something that was actually relevant, the audit log shows the rule and a hash of the dropped content so you can tune it.
How does it handle file formats I haven't listed?
PortEden ships format-aware parsers for Word (DOCX), Excel (XLSX), PowerPoint (PPTX), PDF, Outlook MSG, EML, plain text, Markdown, HTML, and the major source-code languages. For unknown formats, the engine falls back to a content-only extraction that drops everything that isn't text body. You can also register a custom parser via the management API if you have a proprietary format with structured metadata you want handled cleanly. The fallback never sends raw binary or unparsed metadata to the model — when in doubt, hygiene strips first and asks questions later.
Can I tag content as @ai-include to keep it through pruning?
Yes. PortEden honors a small set of inline directives that authors can place in comments, document properties, or headers: @ai-include keeps a section through pruning even if it scores low on the topic classifier; @ai-exclude drops a section even if it scores high; @ai-redact-only sends the content but with redaction applied; @ai-summary-only sends a one-line summary instead of the body. Tags are evaluated before the topic and recency filters run. This gives subject-matter experts a deterministic way to control what the AI sees without writing policy in YAML — useful for legal, medical, and engineering teams whose comment culture is already structured.
Does it dedupe across multiple integrations?
Yes — that's one of the highest-leverage features. When a user attaches a Drive folder and a Slack thread to the same prompt, the same content often appears in both: a doc shared in chat, an email forwarded to a channel, a meeting note copied into a wiki. PortEden's deduper runs paragraph-level similarity (MinHash + cosine on small embeddings) across all sources in the bundle and collapses near-duplicates to one canonical instance, with citation paths preserved so the model can still attribute. This alone saves 20–40% of tokens in a typical research-style prompt. Cross-integration dedup is on the Business and Enterprise tiers.
How is "stale" or "off-topic" content decided?
Two filters, both configurable. Recency: a per-integration time window (Slack: last 30 days; email: last 90 days; wiki: last 12 months) applied to last-modified-at, with overrides for content tagged as canonical. Topic: a small classifier scores each paragraph for relatedness to the user's prompt; paragraphs below a threshold are dropped, and the threshold is tuned per use-case (broad summarization runs more conservatively than narrow extraction). Both filters log every drop with the rule name, the score, and a content hash, so you can audit decisions and tune the policy if it's too aggressive.
Will my AI output be smaller, faster, and cheaper as a result?
Yes, all three. Smaller: bundles average 80–90% reduction on documents with revision history, hidden sheets, or quoted email threads, and 40–60% on cleaner sources with cross-integration dedup. Faster: most LLMs price prefill latency by token count, so a 10K-token bundle responds noticeably faster than an 80K-token bundle. Cheaper: input tokens are billed; the bundle that doesn't ship doesn't bill. For a team running 10K AI prompts per month against typical knowledge work, hygiene typically pays for itself in token savings before any compliance benefit is counted. Smaller bundles also reduce hallucination by giving the model a tighter relevance signal.
What evidence does this produce for GDPR Art. 5(1)(c) data-minimisation auditors?
Article 5(1)(c) directs controllers to keep personal data "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." When you send a Word document to an AI vendor, the visible content is what's necessary; the revision history, comments, author metadata, and embedded fonts typically are not. Article 25 reinforces this with "data protection by default." PortEden enforces minimisation at the egress boundary by default: every bundle is hygiene-processed before transmission, every drop is logged with the rule that fired and a hash of the dropped content, and the audit trail produces per-request evidence that controllers can hand to a supervisory authority on request. Compliance with GDPR remains your responsibility — PortEden provides the technical control, you operate the program around it.
What evidence does this produce for HIPAA Minimum Necessary auditors?
The Privacy Rule's Minimum Necessary standard at §164.502(b), with implementation requirements in §164.514(d), directs covered entities to limit uses, disclosures, and requests of PHI to the minimum necessary for the intended purpose. "Send the whole chart to the AI scribe" rarely meets that bar; "send only the encounter note relevant to the current visit" usually does. PortEden's hygiene engine produces per-integration minimisation events — what was sent, what was withheld, and the rule that fired — so auditors get a per-request answer to "was the disclosure limited to the minimum necessary?" without reconstructing it from screenshots. Compliance with HIPAA remains your responsibility — PortEden provides the technical control, you operate the program around it.
Are hygiene actions logged?
Every action — every metadata field stripped, every duplicate collapsed, every paragraph pruned, every directive honored — is recorded in the audit trail with the user, the AI client, the integration, the rule that fired, the score (for topic and recency drops), a hash of the dropped content, and a timestamp. Logs export to SIEM (Splunk, Datadog, Elastic) or to a signed CSV for evidence collection. The audit-trail product surfaces hygiene events alongside redaction and access-control events, so a single timeline shows exactly what was sent to which AI client and why everything else was withheld.
What pricing tier includes context hygiene?
Format-aware metadata stripping for the major file formats (Word, Excel, PDF, email) and per-integration recency windows are included on the Pro tier. Cross-integration deduplication and topic-classifier pruning are on the Business tier. Custom inline directives at scale, SSO/SAML, SCIM, and SIEM export are on the Enterprise tier. See pricing for the full breakdown.

Send less to the model. Get faster, cheaper, safer answers.

Set up context hygiene in under 10 minutes. Free tier covers solo users; Enterprise adds SSO/SAML, SCIM, change-control workflows, and SIEM export of every hygiene decision.

Talk to sales