PageProbe 2.0.0

dotnet add package PageProbe --version 2.0.0
                    
NuGet\Install-Package PageProbe -Version 2.0.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="PageProbe" Version="2.0.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="PageProbe" Version="2.0.0" />
                    
Directory.Packages.props
<PackageReference Include="PageProbe" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add PageProbe --version 2.0.0
                    
#r "nuget: PageProbe, 2.0.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#addin nuget:?package=PageProbe&version=2.0.0
                    
Install PageProbe as a Cake Addin
#tool nuget:?package=PageProbe&version=2.0.0
                    
Install PageProbe as a Cake Tool

PageProbe

PageProbe is a .NET-based web crawling library designed to monitor and extract content from statically generated websites. It enables developers, IT students, and enthusiasts to gather and export data such as links, media, metadata, multimedia, text, or even price information. The library supports both real-time and scheduled crawling, with the ability to store results and compare differences over time.


Features

  • Modular and extensible architecture using interfaces and models
  • Extract links, images, videos, metadata, and plain text from webpages
  • Specialized price crawler with regex-based rule support
  • Specialized article crawler for extracting headings, publish dates, authors etc.
  • Multipage crawling with depth limiter and robots.txt awareness
  • Export to multiple formats (JSON, CSV, XML, TXT, Markdown)
  • Snapshot comparisons for detecting changes
  • Console and file-based output
  • Built-in logging support for debugging and monitoring
  • Asynchronous support for crawling over time

📦 Installation Methods

1. NuGet Installation (via Visual Studio or command line)

Install the package with the Package Manager Console:

Install-Package PageProbe

2. .NET CLI Installation (Cross platform, works in any terminal)

Install the package with the .NET CLI:

dotnet add package PageProbe

3. Manual install in project file (.csproj)

Add the following to your project file:

<ItemGroup>
  <PackageReference Include="PageProbe" Version="Enter the newest/wanted version" />
</ItemGroup>

4. Clone the Repository

Clone the repo and add it to your solution if you want to customize or extend the library:

git clone https://dev.azure.com/emilberglund/_git/rammeverk_gruppe2

🚀 Start using PageProbe

Reference PageProbe in your project with the namespace:

using PageProbe;

Project Structure

Main Components

🕷 Crawlers
  • BaseCrawler: Core logic for HTML parsing and extraction
  • ArticleCrawler: Extends BaseCrawler, extracts headlines, subheadings, publish date, and author from article pages
  • PriceCrawler: Extends BaseCrawler, specialized for extracting price data using defined regex rules
  • AdvancedCrawler: Extends BaseCrawler, supports depth-limited, robots.txt-aware, multi-page crawling with logging
🧩 Interfaces
  • ILinkCrawler: Link extraction
  • IMediaCrawler: Image/media extraction
  • IMetadataCrawler: Meta tags (title, description, keywords)
  • IMultimediaCrawler: Video/audio/iframe extraction
  • ITextCrawler: Plain text extraction
  • IArticleCrawler: Article-specific extraction (headline, subheadings, etc.)
  • IExporter, IExportDifferences: Export results and diffs
📦 Models
  • CrawlResult: Snapshot of crawled content (URL, title, description, keywords, links, images, multimedia, text)
  • CrawlDifferences: Differences between crawl snapshots
  • PriceExtractionRule: Regex-based rule for price detection
📁 File Handlers (Exporters)
  • JsonExporter, CsvExporter, MarkdownExporter, TextExporter, XmlExporter: Export to corresponding file formats
🛠 Utilities & Exceptions
  • RobotsTxtHandler: Fetches and parses robots.txt, enforces crawling rules
  • ContentNotFoundException, DynamicContentException, InvalidUrlException: Custom exceptions for robust error handling

💡 Usage Examples

🌐 Example 1: BaseCrawler with export

public static void Main(string[] args)
{
    var crawler = new BaseCrawler();                        // Initialize the crawler with your desired settings
    var url = "[https://example.com](https://example.com)"; // Replace with the URL you want to crawl
    var exporter = new JsonExporter();                      // Replace with your desired exporter
    var crawlResult = new CrawlResult()
    {
        Url = url,
        Title = crawler.GetMetaTitle(url),
        Description = crawler.GetMetaDescription(url),
        Keywords = crawler.GetMetaKeywords(url).Split(new[] { ',', ' ' }).ToList(),
        Links = crawler.GetLinks(url),
        Images = crawler.GetImages(url),
        Multimedia = crawler.GetMultimedia(url),
        Text = crawler.GetText(url),
    };
    exporter.Export(data: crawlResult, fileName: "crawl_example");
}

📰 Example 2: ArticleCrawler

public static void Main(string[] args)
{
    var articleCrawler = new ArticleCrawler();
    var url = "https://example.com/article";

    var headlines = articleCrawler.ExtractHeadLine(url);
    var subHeadings = articleCrawler.ExtractSubheadings(url);
    var publishDate = articleCrawler.ExtractPublishDate(url);
    var author = articleCrawler.ExtractAuthor(url);
}

💰 Example 3: PriceCrawler

public static async Task Main(string[] args)
{
    var priceCrawler = new PriceCrawler();

    var url = "https://example.com/product-page";

    priceCrawler.ExtractProductTitle(url: url).ToList().ForEach(Console.WriteLine);
    priceCrawler.ExtractMainPrice(url: url).ToList().ForEach(Console.WriteLine);
}

🌐 Example 4: AdvancedCrawler with Export

public static void Main(string[] args)
{
    var advancedCrawler = new AdvancedCrawler();    // Initialize the crawler with your desired settings
    var exporter = new XmlExporter();               // Replace with desired exporter
    var url = "https://example.com";                // Replace with the URL you want to crawl
    
    var results = advancedCrawler.CrawlSite(startUrl: url, maxDepth: 1, respectRobotsTxt: true, maxUrls: 10);
    results.ForEach(result => exporter.Export(result, $"crawl_results_{results.IndexOf(result)}"));
}

Full Documentation

See the full documentation here for architecture, API reference, and advanced usage.


Build & Test

Build the Solution

  1. Open in Visual Studio
  2. Press Ctrl + Shift + B

Run Tests

dotnet test

Contributing

As this is an internal project:

  • Use feature/* branches
  • Follow .NET conventions
  • Write meaningful commit messages

License

This project is published on NuGet and distributed publicly

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
2.0.0 47 5/27/2025
1.0.9 50 5/25/2025
1.0.8 54 5/25/2025
1.0.7 53 5/25/2025
1.0.1 61 5/23/2025
1.0.0 137 5/23/2025 1.0.0 is deprecated.