PageProbe 2.0.0
dotnet add package PageProbe --version 2.0.0
NuGet\Install-Package PageProbe -Version 2.0.0
<PackageReference Include="PageProbe" Version="2.0.0" />
<PackageVersion Include="PageProbe" Version="2.0.0" />
<PackageReference Include="PageProbe" />
paket add PageProbe --version 2.0.0
#r "nuget: PageProbe, 2.0.0"
#addin nuget:?package=PageProbe&version=2.0.0
#tool nuget:?package=PageProbe&version=2.0.0
PageProbe
PageProbe is a .NET-based web crawling library designed to monitor and extract content from statically generated websites. It enables developers, IT students, and enthusiasts to gather and export data such as links, media, metadata, multimedia, text, or even price information. The library supports both real-time and scheduled crawling, with the ability to store results and compare differences over time.
Features
- Modular and extensible architecture using interfaces and models
- Extract links, images, videos, metadata, and plain text from webpages
- Specialized price crawler with regex-based rule support
- Specialized article crawler for extracting headings, publish dates, authors etc.
- Multipage crawling with depth limiter and robots.txt awareness
- Export to multiple formats (JSON, CSV, XML, TXT, Markdown)
- Snapshot comparisons for detecting changes
- Console and file-based output
- Built-in logging support for debugging and monitoring
- Asynchronous support for crawling over time
📦 Installation Methods
1. NuGet Installation (via Visual Studio or command line)
Install the package with the Package Manager Console:
Install-Package PageProbe
2. .NET CLI Installation (Cross platform, works in any terminal)
Install the package with the .NET CLI:
dotnet add package PageProbe
3. Manual install in project file (.csproj
)
Add the following to your project file:
<ItemGroup>
<PackageReference Include="PageProbe" Version="Enter the newest/wanted version" />
</ItemGroup>
4. Clone the Repository
Clone the repo and add it to your solution if you want to customize or extend the library:
git clone https://dev.azure.com/emilberglund/_git/rammeverk_gruppe2
🚀 Start using PageProbe
Reference PageProbe in your project with the namespace:
using PageProbe;
Project Structure
Main Components
🕷 Crawlers
BaseCrawler
: Core logic for HTML parsing and extractionArticleCrawler
: ExtendsBaseCrawler
, extracts headlines, subheadings, publish date, and author from article pagesPriceCrawler
: ExtendsBaseCrawler
, specialized for extracting price data using defined regex rulesAdvancedCrawler
: ExtendsBaseCrawler
, supports depth-limited, robots.txt-aware, multi-page crawling with logging
🧩 Interfaces
ILinkCrawler
: Link extractionIMediaCrawler
: Image/media extractionIMetadataCrawler
: Meta tags (title, description, keywords)IMultimediaCrawler
: Video/audio/iframe extractionITextCrawler
: Plain text extractionIArticleCrawler
: Article-specific extraction (headline, subheadings, etc.)IExporter
,IExportDifferences
: Export results and diffs
📦 Models
CrawlResult
: Snapshot of crawled content (URL, title, description, keywords, links, images, multimedia, text)CrawlDifferences
: Differences between crawl snapshotsPriceExtractionRule
: Regex-based rule for price detection
📁 File Handlers (Exporters)
JsonExporter
,CsvExporter
,MarkdownExporter
,TextExporter
,XmlExporter
: Export to corresponding file formats
🛠 Utilities & Exceptions
RobotsTxtHandler
: Fetches and parses robots.txt, enforces crawling rulesContentNotFoundException
,DynamicContentException
,InvalidUrlException
: Custom exceptions for robust error handling
💡 Usage Examples
🌐 Example 1: BaseCrawler with export
public static void Main(string[] args)
{
var crawler = new BaseCrawler(); // Initialize the crawler with your desired settings
var url = "[https://example.com](https://example.com)"; // Replace with the URL you want to crawl
var exporter = new JsonExporter(); // Replace with your desired exporter
var crawlResult = new CrawlResult()
{
Url = url,
Title = crawler.GetMetaTitle(url),
Description = crawler.GetMetaDescription(url),
Keywords = crawler.GetMetaKeywords(url).Split(new[] { ',', ' ' }).ToList(),
Links = crawler.GetLinks(url),
Images = crawler.GetImages(url),
Multimedia = crawler.GetMultimedia(url),
Text = crawler.GetText(url),
};
exporter.Export(data: crawlResult, fileName: "crawl_example");
}
📰 Example 2: ArticleCrawler
public static void Main(string[] args)
{
var articleCrawler = new ArticleCrawler();
var url = "https://example.com/article";
var headlines = articleCrawler.ExtractHeadLine(url);
var subHeadings = articleCrawler.ExtractSubheadings(url);
var publishDate = articleCrawler.ExtractPublishDate(url);
var author = articleCrawler.ExtractAuthor(url);
}
💰 Example 3: PriceCrawler
public static async Task Main(string[] args)
{
var priceCrawler = new PriceCrawler();
var url = "https://example.com/product-page";
priceCrawler.ExtractProductTitle(url: url).ToList().ForEach(Console.WriteLine);
priceCrawler.ExtractMainPrice(url: url).ToList().ForEach(Console.WriteLine);
}
🌐 Example 4: AdvancedCrawler with Export
public static void Main(string[] args)
{
var advancedCrawler = new AdvancedCrawler(); // Initialize the crawler with your desired settings
var exporter = new XmlExporter(); // Replace with desired exporter
var url = "https://example.com"; // Replace with the URL you want to crawl
var results = advancedCrawler.CrawlSite(startUrl: url, maxDepth: 1, respectRobotsTxt: true, maxUrls: 10);
results.ForEach(result => exporter.Export(result, $"crawl_results_{results.IndexOf(result)}"));
}
Full Documentation
See the full documentation here for architecture, API reference, and advanced usage.
Build & Test
Build the Solution
- Open in Visual Studio
- Press
Ctrl + Shift + B
Run Tests
dotnet test
Contributing
As this is an internal project:
- Use
feature/*
branches - Follow .NET conventions
- Write meaningful commit messages
License
This project is published on NuGet and distributed publicly
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Microsoft.Extensions.Logging (>= 9.0.4)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.4)
- Microsoft.Extensions.Logging.Console (>= 9.0.4)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.