Question 1

What is AI context hygiene and how is it different from redaction?

Accepted Answer

Redaction masks sensitive values inside the visible content — names, account numbers, secrets — so the model can't read them. Context hygiene operates one layer up: even before redaction, it decides which parts of a file or thread should be in the context bundle at all. A Word document carries tracked changes, revision history, comments, author metadata, embedded fonts, and document properties. A spreadsheet hides columns and sheets. An email quotes every prior reply. Context hygiene strips that structural metadata, dedupes overlapping sources, and prunes off-topic and stale content. Redaction is for what's visible. Hygiene is for what's loaded.

Question 2

What kinds of metadata get stripped from a context bundle?

Accepted Answer

PortEden's hygiene engine knows the metadata surfaces of every file format it parses. Word: revision history, tracked changes, comments, author names, last-modified timestamps, custom XML, embedded font names, document properties. Excel: hidden rows, hidden columns, hidden sheets, comment threads, conditional-formatting rules, named-range remnants, formulas pointing to source paths. PDF: EXIF, geolocation, device IDs, embedded JavaScript, form metadata, redaction artifacts that were never properly flattened. Email: tracking pixels, prior thread quotes, recipient chains, X-headers, message IDs, reply-to chains. Source code: .git directories, embedded credentials in tests, stale TODO comments. None of it is in the visible content; all of it flows into the model unless something strips it.

Question 3

Will the AI lose useful context when this runs?

Accepted Answer

No — that's the design point. The hygiene engine only removes content that is structural metadata, an exact or near-duplicate of content already in the bundle, demonstrably out of policy time-window, or off-topic per a topic classifier scored against the user's prompt. Visible body text, the most-recent revision, comments tagged @ai-include, and on-topic paragraphs all survive. In our internal benchmarks, hygiene reduces bundle size 80–90% with zero loss of answer quality on a paired set of summarization, extraction, and Q&A tasks. If a hygiene rule ever drops something that was actually relevant, the audit log shows the rule and a hash of the dropped content so you can tune it.

Question 4

How does it handle file formats I haven't listed?

Accepted Answer

PortEden ships format-aware parsers for Word (DOCX), Excel (XLSX), PowerPoint (PPTX), PDF, Outlook MSG, EML, plain text, Markdown, HTML, and the major source-code languages. For unknown formats, the engine falls back to a content-only extraction that drops everything that isn't text body. You can also register a custom parser via the management API if you have a proprietary format with structured metadata you want handled cleanly. The fallback never sends raw binary or unparsed metadata to the model — when in doubt, hygiene strips first and asks questions later.

Question 5

Can I tag content as @ai-include to keep it through pruning?

Accepted Answer

Yes. PortEden honors a small set of inline directives that authors can place in comments, document properties, or headers: @ai-include keeps a section through pruning even if it scores low on the topic classifier; @ai-exclude drops a section even if it scores high; @ai-redact-only sends the content but with redaction applied; @ai-summary-only sends a one-line summary instead of the body. Tags are evaluated before the topic and recency filters run. This gives subject-matter experts a deterministic way to control what the AI sees without writing policy in YAML — useful for legal, medical, and engineering teams whose comment culture is already structured.

Question 6

Does it dedupe across multiple integrations?

Accepted Answer

Yes — that's one of the highest-leverage features. When a user attaches a Drive folder and a Slack thread to the same prompt, the same content often appears in both: a doc shared in chat, an email forwarded to a channel, a meeting note copied into a wiki. PortEden's deduper runs paragraph-level similarity (MinHash + cosine on small embeddings) across all sources in the bundle and collapses near-duplicates to one canonical instance, with citation paths preserved so the model can still attribute. This alone saves 20–40% of tokens in a typical research-style prompt. Cross-integration dedup is on the Business and Enterprise tiers.

Question 7

How is "stale" or "off-topic" content decided?

Accepted Answer

Two filters, both configurable. Recency: a per-integration time window (Slack: last 30 days; email: last 90 days; wiki: last 12 months) applied to last-modified-at, with overrides for content tagged as canonical. Topic: a small classifier scores each paragraph for relatedness to the user's prompt; paragraphs below a threshold are dropped, and the threshold is tuned per use-case (broad summarization runs more conservatively than narrow extraction). Both filters log every drop with the rule name, the score, and a content hash, so you can audit decisions and tune the policy if it's too aggressive.

Question 8

Will my AI output be smaller, faster, and cheaper as a result?

Accepted Answer

Yes, all three. Smaller: bundles average 80–90% reduction on documents with revision history, hidden sheets, or quoted email threads, and 40–60% on cleaner sources with cross-integration dedup. Faster: most LLMs price prefill latency by token count, so a 10K-token bundle responds noticeably faster than an 80K-token bundle. Cheaper: input tokens are billed; the bundle that doesn't ship doesn't bill. For a team running 10K AI prompts per month against typical knowledge work, hygiene typically pays for itself in token savings before any compliance benefit is counted. Smaller bundles also reduce hallucination by giving the model a tighter relevance signal.

Question 9

What evidence does this produce for GDPR Art. 5(1)(c) data-minimisation auditors?

Accepted Answer

Article 5(1)(c) directs controllers to keep personal data "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." When you send a Word document to an AI vendor, the visible content is what's necessary; the revision history, comments, author metadata, and embedded fonts typically are not. Article 25 reinforces this with "data protection by default." PortEden enforces minimisation at the egress boundary by default: every bundle is hygiene-processed before transmission, every drop is logged with the rule that fired and a hash of the dropped content, and the audit trail produces per-request evidence that controllers can hand to a supervisory authority on request. Compliance with GDPR remains your responsibility — PortEden provides the technical control, you operate the program around it.

Question 10

What evidence does this produce for HIPAA Minimum Necessary auditors?

Accepted Answer

The Privacy Rule's Minimum Necessary standard at §164.502(b), with implementation requirements in §164.514(d), directs covered entities to limit uses, disclosures, and requests of PHI to the minimum necessary for the intended purpose. "Send the whole chart to the AI scribe" rarely meets that bar; "send only the encounter note relevant to the current visit" usually does. PortEden's hygiene engine produces per-integration minimisation events — what was sent, what was withheld, and the rule that fired — so auditors get a per-request answer to "was the disclosure limited to the minimum necessary?" without reconstructing it from screenshots. Compliance with HIPAA remains your responsibility — PortEden provides the technical control, you operate the program around it.

Question 11

Are hygiene actions logged?

Accepted Answer

Every action — every metadata field stripped, every duplicate collapsed, every paragraph pruned, every directive honored — is recorded in the audit trail with the user, the AI client, the integration, the rule that fired, the score (for topic and recency drops), a hash of the dropped content, and a timestamp. Logs export to SIEM (Splunk, Datadog, Elastic) or to a signed CSV for evidence collection. The audit-trail product surfaces hygiene events alongside redaction and access-control events, so a single timeline shows exactly what was sent to which AI client and why everything else was withheld.

Question 12

What pricing tier includes context hygiene?

Accepted Answer

Format-aware metadata stripping for the major file formats (Word, Excel, PDF, email) and per-integration recency windows are included on the Pro tier. Cross-integration deduplication and topic-classifier pruning are on the Business tier. Custom inline directives at scale, SSO/SAML, SCIM, and SIEM export are on the Enterprise tier. See pricing for the full breakdown.

The AI doesn't need everything in the file — we send only what it actually needs.

Files carry far more than what's visible on screen.

Word, Excel, and PDF files carry their history

Email threads quote everything

Spreadsheets have hidden columns and rows

One file, one prompt — two very different bundles.

Six categories of context, all cleaned at the boundary.

Document metadata

Email metadata

Spreadsheet hygiene

PDF & image hygiene

Source dedup

Stale-context pruning

Strip. Dedupe. Prune. Minimize.

1. Strip

2. Dedupe

3. Prune

4. Minimize

A real prompt, a real bundle, a real diff.

The same prompt, two very different bundles.

Citations, not vague reassurances.

Every source the AI pulls context from.

One hygiene engine, six regulated workflows.

Minimum Necessary, made enforceable

Matter context without exposing the case file

Deal context, cleaned of pre-decision drafts

CUI handling that strips metadata first

Per-client context isolation

Code context without secrets in .git

Format-aware, composable, and auditable.

Format-aware parsers

Composable with redaction

Auditable minimization

Pairs well with

AI Data Redaction

Access Control

Audit Trail

Policy Groups

Context hygiene questions

Send less to the model. Get faster, cheaper, safer answers.