Mostlylucid.StyloExtract.Core 2.0.1

dotnet add package Mostlylucid.StyloExtract.Core --version 2.0.1
                    
NuGet\Install-Package Mostlylucid.StyloExtract.Core -Version 2.0.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Mostlylucid.StyloExtract.Core" Version="2.0.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Mostlylucid.StyloExtract.Core" Version="2.0.1" />
                    
Directory.Packages.props
<PackageReference Include="Mostlylucid.StyloExtract.Core" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Mostlylucid.StyloExtract.Core --version 2.0.1
                    
#r "nuget: Mostlylucid.StyloExtract.Core, 2.0.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Mostlylucid.StyloExtract.Core@2.0.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Mostlylucid.StyloExtract.Core&version=2.0.1
                    
Install as a Cake Addin
#tool nuget:?package=Mostlylucid.StyloExtract.Core&version=2.0.1
                    
Install as a Cake Tool

Mostlylucid.StyloExtract.Core

Extraction orchestration: wires parse, fingerprint, match, induce, apply, and render into ILayoutExtractor.ExtractAsync.

What this package is

LayoutExtractor (implements ILayoutExtractor) is the single entry point for extraction. It coordinates the full pipeline:

  1. Parse HTML via IHtmlDomParser
  2. Clean DOM via IDomCleaner
  3. Fingerprint via IStructuralFingerprinter (MinHash + LSH + anchor-path + pq-grams)
  4. Fast-path LSH match against ITemplateIndex (< 1 ms for known templates)
  5. If miss: slow-path pq-gram cosine match
  6. If novel: segment + classify + induce extractor via IExtractorInducer
  7. Apply extractor via IExtractorApplicator (or heuristic classification on novel)
  8. Render to Markdown via IMarkdownRenderer
  9. Record observation; trigger refit if drift threshold exceeded
  10. Emit StyloExtractSignal events via TypedSignalSink

When to depend on this directly

Consumed transitively by Mostlylucid.StyloExtract.AspNetCore. Take a direct dependency only if you are wiring the DI registrations manually (e.g. in a non-ASP.NET host) or adding the LayoutExtractor to a custom container.

Usage

// Standard wiring via AddStyloExtract (preferred)
builder.Services.AddStyloExtract(o => { o.StorePath = "store.db"; });

// Inject and call
var extractor = sp.GetRequiredService<ILayoutExtractor>();
var result = await extractor.ExtractAsync(
    html,
    new Uri("https://example.com/article"),
    new ExtractionOptions { Profile = ExtractionProfile.RagFull });

Console.WriteLine(result.Markdown);
Console.WriteLine(result.Match.Status);        // FastPathHit on repeat visits
Console.WriteLine(result.Match.TemplateVersion);

AOT

This package is IsAotCompatible=true.


Full documentation and package family

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on Mostlylucid.StyloExtract.Core:

Package Downloads
Mostlylucid.StyloExtract.AspNetCore

AddStyloExtract() DI extensions for ASP.NET Core. The response-policy framework (IResponsePolicy) is the canonical response-transformation primitive: Markdown content negotiation and cache-hint emission are the first two built-in instances. Brings in the full StyloExtract stack wired through Microsoft.Extensions.DependencyInjection. Opt-in middleware, per-action attributes, and Minimal API extensions transparently convert HTML responses to Markdown when clients send Accept: text/markdown. Browser-friendly query-string Accept override and opt-in IDistributedCache support included.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
2.0.1 43 6/30/2026
2.0.0 63 6/28/2026
1.8.0 72 6/27/2026
1.8.0-alpha.23 57 6/27/2026
1.8.0-alpha.22 54 6/27/2026
1.8.0-alpha.21 51 6/27/2026
1.8.0-alpha.20 66 6/27/2026
1.8.0-alpha.19 58 6/26/2026
1.8.0-alpha.18 77 6/26/2026
1.8.0-alpha.17 84 6/26/2026
1.8.0-alpha.16 82 6/26/2026
1.8.0-alpha.15 68 6/26/2026
1.8.0-alpha.14 61 6/26/2026
1.8.0-alpha.13 53 6/26/2026
1.8.0-alpha.12 55 6/26/2026
1.8.0-alpha.11 57 6/26/2026
1.8.0-alpha.10 58 6/26/2026
1.8.0-alpha.9 62 6/25/2026
1.8.0-alpha.8 59 6/25/2026
1.8.0-alpha.4 59 6/25/2026
Loading failed

StyloExtract 2.0.0 - 2026-06-28
================================

First stable release. Closes Phase 1 + Phase 2 of the identity-claim
rework that ran across alpha.22, alpha.23, and the in-flight code that
never tagged. Stable means the v2 API contracts (IdentityClaim, the
streaming options, the operator-template shape with Claims, the apply-
time quality gate) are now things consumers can build on.

What's new since 1.8.0-alpha.21
-------------------------------

Identity-claim primitive (Phase 1)

- New `IdentityClaim` type — outermost-first ancestor chain of
 (tag, id, classes, data-* / aria-* / role) entries, anchoring every
 selector by stable identity rather than by CSS string.
- `DefaultClassStabilityFilter` rejects hash-shaped class tokens
 (Tailwind JIT names, CSS-module hashes, build-time churn) so that
 emitted claims survive across visits.
- Inducer is identity-aware end-to-end: cardinality-aware uniqueness
 for repeated roles, narrow tripwires for the streaming side, no
 CSS-string emission anywhere on the apply path.
- Layout extractor's apply path runs on `IdentityClaimApplicator`;
 the old CSS-string applicator is gone.

Streaming gateway: exact tripwire matching + bounded memory

- The streaming scanner shifted from MinHash + LSH bands to exact
 `IdentityClaim` matching against the per-event hash data the
 tokenizer carries on each `TagEvent` (tag-name + id + per-class
 + per-data-attr + per-aria + role hashes). The matcher walks the
 claim's required hashes linearly against the event's hash arrays;
 no per-tick MinHash recompute, no sliding window.
- `StreamingTokenizerOptions` replaces the hard `MaxBufferSize`
 consts on `IncrementalHtmlTokenizer` and
 `IncrementalBytePatternScanner`. Both buffers are now rented from
 `ArrayPool<byte>.Shared` and grow on demand up to the configurable
 ceiling (default 1 MiB per buffer). Both classes are `IDisposable`.
- `TagAttrLimits` replaces the per-event `TagEvent.MaxClassesPerEvent`
 (was 8) and `TagEvent.MaxAttrPairsPerEvent` (was 3). Defaults
 bumped to 32 / 16, validated up to 256 / 128 ceilings. Real pages
 no longer silently lose the tail.
- Streaming-template inducer rewrite (Task 4 of Phase 1) — emits
 `IdentityClaim`-based tripwires shared with the layout side.
- Incremental byte-pattern scanner (Task 13) replaces the alpha.21
 tripwire scanner with a faster exact-match path; tag-hash prefilter
 cuts per-scan allocs ~25-30x.

Apply-time quality gate + auto-repair loop

- New `ApplicatorBrokenCheck` lifts the apply-time bug-out signal
 out of LayoutExtractor's local function into a unit-testable gate.
 Three new failure modes: noisy-MainContent (link-density >= 0.5
 inside a content block, catches the Wikipedia / mostlylucid
 language-picker leak), image-anchor picker (many short-text
 anchors, catches the route-variant strip), metadata-shape
 rejection (key:value-dominated blocks, catches the MS Learn YAML
 frontmatter leak).
- LayoutExtractor Move 3 widens the repair-enqueue gate: drops the
 "hand-authored template must exist" requirement, triggers on
 applicatorBugOut OR thin-markdown, adds Refit to the qualifying
 match-status set.
- `IsDeterministic` flag on `OperatorTemplate` distinguishes the
 heuristic inducer's deterministic YAML audit snapshots from
 hand-authored / LLM-induced templates. Deterministic snapshots no
 longer block LLM induction.
- `OperatorTemplateRule.Claims` carries the identity-claim ancestor
 chain on operator templates so the operator-template path runs on
 the identity-claim applicator instead of the CSS-string fallback.

Heuristic block classifier improvements

- Tighten-on-anchor (Move 1) — after a `<main>`/`<article>` qualifies
 as MainContent, look down one level for a div/section descendant
 with a stable identity anchor (stable id OR >= 1 stable class) that
 carries >= 80% of the wrapper's prose text and has link density
 < 0.5. When exactly one descendant qualifies, prefer it. Catches
 Wikipedia + mostlylucid leaks where the picker rides inside the
 outer semantic element.
- `<article>` semantic-tag exception in repeated-item link-density
 gate — news-listing pattern where each card is a single clickable
 `<a>` (density ~1.0) now survives the gate. The Register, Verge,
 Ars, BBC News listings render again.

Template enrichment coordinator

- `InMemoryTemplateEnrichmentQueue` cooldown key changed from string
 Host to (Host, EnrichmentJobKind) tuple. A first-visit Induce no
 longer blocks a follow-up Repair on the same host.
- New `ILlmActivityObserver` interface brackets each LLM call with
 LlmCallStarted / LlmCallEnded(success). Wired through the DI
 builder so consumers can show "llm <host>..." while CPU inference
 is running (the lucidVIEW FULL status bar uses this).

Corpus mining (Phase 2)

- `SelectorDistance` metric quantifies how similar two emitted
 selectors are for evolved-candidate ranking (Task 6).
- `CorpusMiner` query primitives (Task 7) and evolved-selector
 emission (Task 8) — proposes alternate selectors from the
 template_observations table.
- Passive evaluation of evolved candidates at apply time (Task 9):
 evolved selectors run alongside the chosen one and contribute
 observations for the next mining cycle.
- Background `CorpusMiningCoordinator` (Task 10) drains the
 template_observations table on a cadence and writes evolved
 candidates.
- `template_observations` SQLite table (Task 5 of Phase 1) feeds
 the mining and evolved-candidate paths.

Cold-path arbitrary caps (now configurable or bumped)

- `NextDataRehydrationExtractor` walker bumped from 500 strings /
 depth 12 to 5000 / depth 32 — real Next.js __NEXT_DATA__ blobs
 exceeded the old guards.
- LayoutExtractor's LLM-repair sample bumped 400 -> 2000 chars
 (more context for the LLM to see what's wrong).
- Skeleton renderer's attr-value truncation bumped 40 -> 160 chars
 (covers accessibility-conscious aria-label values).
- Streaming `IncrementalHtmlTokenizer.MaxBufferSize` (was 16 KiB
 const that threw on JSON-LD blobs) replaced by
 `StreamingTokenizerOptions.MaxPartialTagBytes` (1 MiB default).

Breaking changes you need to know about
---------------------------------------

- `IncrementalHtmlTokenizer.MaxBufferSize` and
 `IncrementalBytePatternScanner.MaxBufferSize` public consts removed.
 Replaced by per-instance configuration via
 `StreamingTokenizerOptions`. The instances are now `IDisposable`;
 long-lived consumers should wrap in `using`.
- `TagEvent.MaxClassesPerEvent` and `TagEvent.MaxAttrPairsPerEvent`
 internal consts removed. Caps thread through `TagAttrLimits`,
 configured from `StreamingTokenizerOptions`. Defaults bumped
 (8 -> 32 / 3 -> 16) so existing code that didn't override the cap
 sees the same or wider coverage.
- `TagAttributeParser.ExtractIdentityHashes` now takes a
 `TagAttrLimits` parameter before the `out` arguments. Update
 callers; pass `TagAttrLimits.Default` to keep the new defaults.
- `MinimalHtmlTokenizer` has a new `(input, filter, attrLimits)`
 constructor; the existing two-arg constructor delegates to
 `TagAttrLimits.Default`.
- `OperatorTemplate` gained `IsDeterministic` (bool) and
 `OperatorTemplateRule` gained `Claims`
 (`IReadOnlyList<IdentityClaim>?`). Both are init-only; existing
 call sites compile, but the YAML round-trip writes the new fields
 and the loader sets `IsDeterministic` from the file name.
- Layout extractor's CSS-string applicator path is gone. Templates
 emitted before alpha.22 that depend on string-based selectors
 rebuild through the identity-claim path on first visit.
- `StreamingTemplate` lost its MinHash signature shape — templates
 persisted from alpha.16-alpha.20 re-induce on first visit (the
 store's PRAGMA user_version gate drops stale rows).
- Streaming `RollingSketch` / `TagAllowlistBloom` types removed
 (alpha.21 deprecated the latter; alpha.24 dropped both with the
 byte-pattern matcher).
- `InMemoryTemplateEnrichmentQueue._lastEnqueuedByHost` (private)
 changed shape; only matters if you reflected against it.

Tests: 850 across 12 projects, all green.

Migration: most consumers don't need to change anything. The two
patterns that DO need a change are (a) anyone who passed
`IncrementalHtmlTokenizer.MaxBufferSize` to size their own buffer
(use `tok.MaxPartialTagBytes` instead) and (b) anyone who called
`TagAttributeParser.ExtractIdentityHashes` directly (add
`TagAttrLimits.Default` as the second argument).


StyloExtract 1.8.0-alpha.21 - 2026-06-27
=========================================

Streaming: scope fixes (no algorithm replacement)
--------------------------------------------------

Tightens the alpha.19 streaming scanner without replacing the MinHash
matcher. The algorithm shape (MinHash + LSH bands + three fences per
template) is unchanged; what changes is its scope:

1. IncrementalHtmlTokenizer.Feed no longer copies the whole chunk into
  _buffer. Chunks are parsed inline; only the partial-tag tail (if a
  tag straddles a chunk boundary) is retained for stitching with the
  next chunk. PeakBufferedBytes is now bounded by O(longest tag), not
  O(chunk size). Measured: peak = 0 B for a 200 KB body in 16 KB
  chunks, 19 B in 1 KB chunks. MaxBufferSize lowered from 64 KiB to
  4 KiB.

2. RollingSketch shingles upgraded to Markov bigrams: each shingle is
  (prevTagHash, currentTagHash, currentClassHash). Order-sensitive:
  [A, B] and [B, A] now produce different signatures. The leftmost
  shingle in any window uses prevTag = 0 so sliding-window scanners
  match fences built from contiguous event sequences regardless of
  what came before the window.

3. Static StructuralTagAllowlist replaces per-fence TagAllowlistBloom.
  Only structural tags (html/body/header/nav/main/article/section/
  div/p/h1-h6/ul/ol/li/table/...) push into the sketch. meta/link/
  script-chrome/img/span/a bypass the recompute entirely. The
  TagAllowlistBloom JSON property is retained as a back-compat sink
  (read-and-discarded) so persisted templates from alpha.16-alpha.20
  round-trip cleanly.

4. Depth-aware capture-end: while in Capturing, ContentEnd only matches
  when DOM depth has returned to (or below) the depth at ContentStart.
  Nested matches mid-content can no longer terminate capture early.

5. Dead StreamingTemplate.MinContentDepth field removed (never read by
  any scanner).

6. FenceScanner and IncrementalFenceScanner now share a single static
  StreamingTick.Step. Both scanners build a StreamingTickState from
  their respective storage (span-backed vs heap-backed) and execute
  literally the same code. Cross-validation tests retained as insurance.

7. IStreamingTemplateStore gains version-chain APIs:
  - GetByHostAtVersionAsync(host, version) — retrieve a specific version.
  - ListVersionsByHostAsync(host) — enumerate all known versions.
  UpsertAsync now APPENDS per (host, version) rather than replacing.
  SQLite store schema migrated to PK (host, version); existing rows
  auto-migrate to version 1 on first open.

Migration notes:
- Persisted SQLite templates from alpha.16-alpha.20 auto-migrate to
 version 1 on first open; existing rows are preserved.
- TemplateFence(uint[], ulong[], ulong, int) constructor removed; the
 new shape is TemplateFence(uint[], ulong[], int). TagAllowlistBloom
 is still readable as a property (returns 0).
- StreamingTemplate.MinContentDepth removed — drop from any code that
 set it in `with` expressions.
- RollingSketch.Push signature changed to Push(prevTagHash, tagHash,
 classHash) — direct users must track prev tag.

StyloExtract 1.8.0-alpha.19 - 2026-06-26
=========================================

Streaming: sliding-window design (no full-buffer retention)
------------------------------------------------------------

Refactors alpha.18's IncrementalHtmlTokenizer + IncrementalFenceScanner
to a TRUE sliding-window streaming design:

1. Bytes: only partial-tag bytes are retained. Once a tag is emitted,
  the bytes are dropped immediately (compact-on-emit, not compact-on-
  next-Feed). New PeakBufferedBytes property exposes the high-watermark
  for telemetry. Worst-case in-flight buffer is O(longest tag), not
  O(megabytes). MaxBufferSize lowered from 1 MiB to 64 KiB and
  repositioned as a hard safety stop that should never be hit under
  correct input — exceeding it now means a single tag (or unclosed
  script/style body) genuinely exceeds 64 KiB and the scan must bail.

2. Events: fixed-size sliding window of the last WindowSize tag events
  (unchanged from alpha.18). Push new, pop oldest. The window is the
  only event-level state.

3. RollingSketch: documented (in IncrementalFenceScanner XML doc) that
  MinHash with min-pooling is NOT reversibly rollable — once an element
  leaves the window, its contribution to min(...) can't be subtracted.
  The sketch therefore rebuilds the signature from the current event
  window after each accepted tag (O(WindowSize × SignatureSize) per
  tick, gated by the Bloom allowlist filter to skip the vast majority
  of inbound tags). The bounded-buffer property — the user's headline
  concern — is satisfied by the tokenizer; the sketch's per-tick recompute
  is the price MinHash charges for the LSH-band locality the matcher
  relies on. The event-level memory remains O(WindowSize) regardless.

4. IncrementalFenceScanner now exposes PeakBufferedBytes and BytesConsumed
  passthroughs from the tokenizer so callers can prove the bounded-memory
  property to telemetry without reaching into the tokenizer directly.
  The duplicated tick logic (mirroring FenceScanner.Tick over heap-backed
  sketch state) is retained — it's hard-pinned to the ref-struct path by
  the existing cross-validation tests, which give us higher confidence
  than refactoring to delegate would.

Memory-cap proof: tests/StreamingMemoryBoundTests.cs feeds 5 MiB of
synthetic HTML in 4 KiB chunks and asserts PeakBufferedBytes stays
under 16 KiB. The streaming gateway can now scan multi-megabyte
responses while holding bounded memory.

Migration: API is unchanged from alpha.18 — refactor is internal. The
new PeakBufferedBytes and BytesConsumed diagnostic properties on
IncrementalFenceScanner are additive. MaxBufferSize is still public but
the new value is 64 KiB (was 1 MiB); only relevant if you were catching
the InvalidOperationException for pathological input.

StyloExtract 1.8.0-alpha.18 - 2026-06-26
=========================================

Streaming: true chunked tokenization + refit/versioning + bench update
-----------------------------------------------------------------------

1. IncrementalHtmlTokenizer + IncrementalFenceScanner
  Stateful tokenizer that survives chunk boundaries. A partial tag at
  the end of one chunk is held in an internal buffer and completed when
  the next Feed call arrives. Pairs with IncrementalFenceScanner —
  callers Feed chunks as they arrive from the network, get a verdict
  per chunk, bail early on Captured / Bailout.

  Trade-off vs MinimalHtmlTokenizer's span path: one buffer allocation
  per request (not per chunk). Use the span path for whole-buffer
  scans, the incremental path for streaming gateways where bytes
  arrive in chunks. Hard cap of 1 MiB on the internal buffer — feed
  throws InvalidOperationException on pathological input that never
  closes a tag, surfacing the failure rather than silently dropping
  bytes.

  Architectural note: FenceScanner stays a ref struct (zero-alloc hot
  path); IncrementalFenceScanner is a heap-backed class that ports the
  same tick logic. The two are kept in lockstep — any drift between
  them is a correctness bug surface and is covered by cross-validation
  tests that feed the same bytes both ways.

2. Streaming-template refit + versioning
  StreamingTemplate gains a Version field (defaults to 1; persists
  across alpha.17 templates without migration). New
  StreamingRefitOrchestrator observes captured-scan output per host
  and kicks off-hot-path refits when either:
    - capture-region EWMA drift exceeds 30% on N consecutive scans, OR
    - every 10th captured scan re-induces and finds different fences
  On refit: version bumps, store is upserted, the new
  IStreamingTemplateVersionSink fires a StreamingTemplateRefitEvent
  (Host, Old/New TemplateId, Old/New Version, Reason, DetectedAt).
  Default sink is a no-op; consumers wire UI telemetry to it.

3. Bench update
  ExtractionComparisonBench gains a New_StreamingScanByHost variant so
  the host-keyed hot-path is benchmarked alongside the original
  GUID-keyed scan. Pre-populates the in-memory store with the
  host="www.mostlylucid.net" template that lucidview FULL hits in
  production.

Migration: additive APIs. Alpha.17 consumers using ScanByHost continue
to work; the incremental tokenizer and the refit orchestrator are
opt-in (use them when feeding chunks / when wiring drift telemetry).

StyloExtract 1.8.0-alpha.17 - 2026-06-26
=========================================

Streaming: host-keyed templates + naive auto-induction
-------------------------------------------------------

Three changes to close the alpha.16 streaming integration loop:

1. Host-keyed lookup
  IStreamingTemplateStore gains GetByHostAsync / TryGetHotByHost /
  UpsertAsync. StreamingTemplate gains a Host field (required). One
  template per host (latest wins). The existing GUID-keyed methods
  remain — Host is the lookup key for consumers; TemplateId stays for
  stable identity / versioning.

2. StreamingPathSelector.ScanByHost(host, bytes)
  Synchronous hot-cache-only host scan. Returns NoTemplate on miss
  so the caller can WarmByHostAsync + retry or induce.
  WarmByHostAsync brings a host's template into the hot cache via
  the durable tier.

3. StreamingTemplateInducer
  Naive first-pass inducer: walks HTML via MinimalHtmlTokenizer,
  finds semantic-marker tag-sequence-pairs (<header>...</header>,
  <p>...</p>...<p>...</p>, <footer>/</main>/</body>) and produces a
  StreamingTemplate ready to upsert. Returns null on pages with no
  identifiable structural fences (plain text, image-only, etc.).
  Describe() returns a human-readable summary of the chosen markers
  for logging.

Storage migrations:
- InMemoryStreamingTemplateStore: adds an in-memory host index.
- SqliteStreamingTemplateStore: adds a 'host' TEXT column + index;
 on-open ALTER TABLE migration handles pre-alpha.17 schemas
 (existing rows get Host="" — reachable only by GUID).

Migration: additive APIs; alpha.16 consumers using only the
GUID-keyed surface continue to work unchanged. The new Host field
on StreamingTemplate IS required — existing construction sites must
set Host="" if they have no host context.

StyloExtract 1.8.0-alpha.16 - 2026-06-26
=========================================

Mostlylucid.StyloExtract.Streaming — zero-allocation byte-stream scanner
------------------------------------------------------------------------

New package on NuGet. Hot-path streaming fence scanner: skips page chrome
and captures the content region as response bytes flow past, using
MinHash-derived structural fences. Zero per-request GC-tracked
allocations in steady state.

Designed for the gateway position — drop into a response pipeline
(HttpClient, Stylobot's edge, ASP.NET output filters) alongside the byte
stream and emit a verdict without buffering the full page.

Public hot-path API:
 StreamingPathSelector.Scan(Guid templateId, ReadOnlySpan<byte> html)
   → ScanVerdict { Continue | Captured | Bailout | NoTemplate }

 // Warm a template into the hot cache:
 await selector.WarmAsync(templateId);

Storage:
 - InMemoryStreamingTemplateStore — single-process LRU.
 - SqliteStreamingTemplateStore — durable; same SQLite file pattern as
   the existing ITemplateIndex but a separate table.

Pairs with the existing StyloExtract.Fingerprint learn path and
ITemplateIndex template store. The streaming template format is its own
shape (TemplateFence with MinHash bloom, content-start/content-end
fences) — not an LLM template or operator template.

Bench results vs LayoutExtractor on mostlylucid fixtures: see
bench/StyloExtract.Streaming.Benchmarks/ (zero-alloc scan competitive
with the full extractor's path-match cost while never building a DOM).

Migration: additive package; consumers add a PackageReference to
Mostlylucid.StyloExtract.Streaming if they want gateway-position
scanning.

StyloExtract 1.8.0-alpha.15 - 2026-06-26
=========================================

RenderOptions.WaitUntil — opt out of NetworkIdle for SPA routing
-----------------------------------------------------------------

PlaywrightHtmlFetcher previously hardcoded WaitUntilState.NetworkIdle
for the primary GotoAsync. On sites with aggressive client-side
routing (BBC News auto-navigates /news → /articles/<id> in the
post-load JS phase), this means the fetcher returns the post-routing
DOM, not the page the user requested.

RenderOptions now exposes a WaitUntil property (PlaywrightWaitUntil
enum: Load / DOMContentLoaded / NetworkIdle / Commit). Default stays
NetworkIdle for backwards compatibility. Consumers fetching SPA-heavy
sites should set Load to capture the initial DOM before the router
fires.

The secondary WaitForLoadStateAsync(NetworkIdle, ...) drain remains —
it's independently bounded by WaitForNetworkIdleTimeout and serves as
a best-effort late-XHR catch-up; safe even with the primary returning
on Load.

PlaywrightWaitUntil is a small enum (not Microsoft.Playwright.WaitUntilState
direct) so consumers don't take a transitive dependency on
Microsoft.Playwright just to pick a strategy.

StyloExtract 1.8.0-alpha.14 - 2026-06-26
=========================================

Sitemap CLI end-to-end regression + LLM nav few-shot
-----------------------------------------------------

1. Sitemap CLI test suite

The alpha.11 stylo-extract sitemap verb has been working on real sites
since alpha.13 (heuristic nav-classification tightening), but nothing
caught regressions. Added 5 end-to-end tests in
StyloExtract.Core.Tests/SitemapCommandTests.cs that invoke the
SitemapCommand.CrawlAsync handler against the mostlylucid-home.html.gz
fixture (real captured homepage, shared with the heuristics suite) plus
a stub HttpMessageHandler and assert: real nav links emitted under
# www.mostlylucid.net, --max-depth 0 emits only the seed Title row,
off-host links are not followed, --max-pages cap honoured exactly, and
--delay-ms enforced with a stopwatch floor. No network access required.

2. LLM induction prompt — nav-classification few-shot

LlmInducerPrompts.System and SystemRepair now include a second worked
example: a blog homepage with header <nav>, breadcrumb,
MainContent + RepeatedItem post cards, and footer <nav>. Mirrors the
patterns the alpha.13 NavPreDetector heuristic correctly classifies.
Rule 6 (RepeatedItem usage) tightened with explicit guidance that
header/footer nav lists are PrimaryNavigation / SecondaryNavigation at
the parent <ul>/<nav> level, NOT RepeatedItem at the <li> level —
closes a known LLM confusion mode.

Tests: snapshot tests in StyloExtract.Core.Tests/LlmInducerPromptsTests.cs
verify the prompt extensions land verbatim so future prompt edits don't
accidentally regress.

StyloExtract 1.8.0-alpha.13 - 2026-06-26
=========================================

Heuristic nav-classification tightening
----------------------------------------

HeuristicBlockClassifier was under-classifying real-world nav patterns
on server-rendered sites — header <nav> strips, header <ul>-of-links,
breadcrumb lists, role="navigation" attributes, footer nav — all landed
as Boilerplate (or weren't extracted at all). Result: the alpha.11
Sitemap profile and stylo-extract sitemap CLI verb produced a one-line
tree even on sites with rich nav, because the classifier didn't surface
PrimaryNavigation / SecondaryNavigation / Breadcrumb roles for them.

Tightened patterns now produce definite role classifications:
 1. <header> <nav> -> PrimaryNavigation (0.9)
 2. Top-of-document <nav> -> PrimaryNavigation (0.85)
 3. <footer> <nav> -> SecondaryNavigation (0.9)
 4. <nav aria-label="breadcrumb"> / class~="breadcrumb" -> Breadcrumb (0.95)
 5. <* role="navigation"> -> PrimaryNavigation (0.95)
 6. Header <ul> of mostly-link <li>s -> PrimaryNavigation (0.85) at
    the <ul> level, suppress descent (was emitting deep Boilerplate)
 7. Footer <ul> of mostly-link <li>s -> SecondaryNavigation (0.85)

Implementation: a new NavPreDetector runs after per-element classification
and injects each detected nav container as a high-score (50000) candidate
at the parent level, then demotes any descendant candidates so greedy
selection picks the nav parent and stops descending into its noise.
Containers nested inside <main>/<article> are skipped — IntraBlockCleaner
already strips them as intra-block contaminants; hoisting would steal
the article's selection win.

Regression fixtures captured from mostlylucid.net + wikipedia.org under
tests/StyloExtract.Heuristics.Tests/Fixtures so the next time a
classifier change regresses real-world nav detection, the bench catches
it before it ships.

Downstream impact: the Sitemap ExtractionProfile and stylo-extract
sitemap CLI verb now produce real nav trees on these sites - see the
lucidview FULL dogfood smoke for evidence.

StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================

DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------

alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.

Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.

StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================

Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.

Title BlockRole
---------------

New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.

MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.

Sitemap ExtractionProfile
-------------------------

New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).

Deterministic YAML persistence
------------------------------

New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.

stylo-extract sitemap CLI verb
------------------------------

New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.

Migration
---------

No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.


StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================

LLM classification accuracy for chrome patterns
------------------------------------------------

Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.

Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / fac

[truncated — see RELEASE_NOTES.txt packaged at root for full history]