PageProbe 2.0.0

.NET 8.0

dotnet add package PageProbe --version 2.0.0

NuGet\Install-Package PageProbe -Version 2.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="PageProbe" Version="2.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="PageProbe" Version="2.0.0" />
                    

                            Directory.Packages.props

<PackageReference Include="PageProbe" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add PageProbe --version 2.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: PageProbe, 2.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#addin nuget:?package=PageProbe&version=2.0.0
                    

                            Install PageProbe as a Cake Addin

#tool nuget:?package=PageProbe&version=2.0.0
                    

                            Install PageProbe as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

PageProbe

PageProbe is a .NET-based web crawling library designed to monitor and extract content from statically generated websites. It enables developers, IT students, and enthusiasts to gather and export data such as links, media, metadata, multimedia, text, or even price information. The library supports both real-time and scheduled crawling, with the ability to store results and compare differences over time.

Features

Modular and extensible architecture using interfaces and models
Extract links, images, videos, metadata, and plain text from webpages
Specialized price crawler with regex-based rule support
Specialized article crawler for extracting headings, publish dates, authors etc.
Multipage crawling with depth limiter and robots.txt awareness
Export to multiple formats (JSON, CSV, XML, TXT, Markdown)
Snapshot comparisons for detecting changes
Console and file-based output
Built-in logging support for debugging and monitoring
Asynchronous support for crawling over time

📦 Installation Methods

1. NuGet Installation (via Visual Studio or command line)

Install the package with the Package Manager Console:

Install-Package PageProbe

2. .NET CLI Installation (Cross platform, works in any terminal)

Install the package with the .NET CLI:

dotnet add package PageProbe

3. Manual install in project file (`.csproj`)

Add the following to your project file:

<ItemGroup>
  <PackageReference Include="PageProbe" Version="Enter the newest/wanted version" />
</ItemGroup>

4. Clone the Repository

Clone the repo and add it to your solution if you want to customize or extend the library:

git clone https://dev.azure.com/emilberglund/_git/rammeverk_gruppe2

🚀 Start using PageProbe

Reference PageProbe in your project with the namespace:

using PageProbe;

Project Structure

Main Components

🕷 Crawlers

BaseCrawler: Core logic for HTML parsing and extraction
ArticleCrawler: Extends BaseCrawler, extracts headlines, subheadings, publish date, and author from article pages
PriceCrawler: Extends BaseCrawler, specialized for extracting price data using defined regex rules
AdvancedCrawler: Extends BaseCrawler, supports depth-limited, robots.txt-aware, multi-page crawling with logging

🧩 Interfaces

ILinkCrawler: Link extraction
IMediaCrawler: Image/media extraction
IMetadataCrawler: Meta tags (title, description, keywords)
IMultimediaCrawler: Video/audio/iframe extraction
ITextCrawler: Plain text extraction
IArticleCrawler: Article-specific extraction (headline, subheadings, etc.)
IExporter, IExportDifferences: Export results and diffs

📦 Models

CrawlResult: Snapshot of crawled content (URL, title, description, keywords, links, images, multimedia, text)
CrawlDifferences: Differences between crawl snapshots
PriceExtractionRule: Regex-based rule for price detection

📁 File Handlers (Exporters)

JsonExporter, CsvExporter, MarkdownExporter, TextExporter, XmlExporter: Export to corresponding file formats

🛠 Utilities & Exceptions

RobotsTxtHandler: Fetches and parses robots.txt, enforces crawling rules
ContentNotFoundException, DynamicContentException, InvalidUrlException: Custom exceptions for robust error handling

💡 Usage Examples

🌐 Example 1: BaseCrawler with export

public static void Main(string[] args)
{
    var crawler = new BaseCrawler();                        // Initialize the crawler with your desired settings
    var url = "[https://example.com](https://example.com)"; // Replace with the URL you want to crawl
    var exporter = new JsonExporter();                      // Replace with your desired exporter
    var crawlResult = new CrawlResult()
    {
        Url = url,
        Title = crawler.GetMetaTitle(url),
        Description = crawler.GetMetaDescription(url),
        Keywords = crawler.GetMetaKeywords(url).Split(new[] { ',', ' ' }).ToList(),
        Links = crawler.GetLinks(url),
        Images = crawler.GetImages(url),
        Multimedia = crawler.GetMultimedia(url),
        Text = crawler.GetText(url),
    };
    exporter.Export(data: crawlResult, fileName: "crawl_example");
}

📰 Example 2: ArticleCrawler

public static void Main(string[] args)
{
    var articleCrawler = new ArticleCrawler();
    var url = "https://example.com/article";

    var headlines = articleCrawler.ExtractHeadLine(url);
    var subHeadings = articleCrawler.ExtractSubheadings(url);
    var publishDate = articleCrawler.ExtractPublishDate(url);
    var author = articleCrawler.ExtractAuthor(url);
}

💰 Example 3: PriceCrawler

public static async Task Main(string[] args)
{
    var priceCrawler = new PriceCrawler();

    var url = "https://example.com/product-page";

    priceCrawler.ExtractProductTitle(url: url).ToList().ForEach(Console.WriteLine);
    priceCrawler.ExtractMainPrice(url: url).ToList().ForEach(Console.WriteLine);
}

🌐 Example 4: AdvancedCrawler with Export

public static void Main(string[] args)
{
    var advancedCrawler = new AdvancedCrawler();    // Initialize the crawler with your desired settings
    var exporter = new XmlExporter();               // Replace with desired exporter
    var url = "https://example.com";                // Replace with the URL you want to crawl
    
    var results = advancedCrawler.CrawlSite(startUrl: url, maxDepth: 1, respectRobotsTxt: true, maxUrls: 10);
    results.ForEach(result => exporter.Export(result, $"crawl_results_{results.IndexOf(result)}"));
}

Full Documentation

See the full documentation here for architecture, API reference, and advanced usage.

Build & Test

Build the Solution

Open in Visual Studio
Press Ctrl + Shift + B

Run Tests

dotnet test

Contributing

As this is an internal project:

Use feature/* branches
Follow .NET conventions
Write meaningful commit messages

License

This project is published on NuGet and distributed publicly

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- Microsoft.Extensions.Logging (>= 9.0.4)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.4)
- Microsoft.Extensions.Logging.Console (>= 9.0.4)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
2.0.0	47	5/27/2025
1.0.9	50	5/25/2025
1.0.8	54	5/25/2025
1.0.7	53	5/25/2025
1.0.1	61	5/23/2025
1.0.0	137	5/23/2025