CrawlSharp 1.0.13

dotnet add package CrawlSharp --version 1.0.13
                    
NuGet\Install-Package CrawlSharp -Version 1.0.13
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="CrawlSharp" Version="1.0.13" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="CrawlSharp" Version="1.0.13" />
                    
Directory.Packages.props
<PackageReference Include="CrawlSharp" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add CrawlSharp --version 1.0.13
                    
#r "nuget: CrawlSharp, 1.0.13"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package CrawlSharp@1.0.13
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=CrawlSharp&version=1.0.13
                    
Install as a Cake Addin
#tool nuget:?package=CrawlSharp&version=1.0.13
                    
Install as a Cake Tool

<img src="https://raw.githubusercontent.com/jchristn/CrawlSharp/refs/heads/main/assets/icon.png" width="256" height="256">

CrawlSharp

NuGet Version NuGet

CrawlSharp is a library and integrated webserver for crawling basic web content.

New in v1.0.x

  • Initial release
  • Added support for headless browser crawling (using Microsoft.Playwright)

Bugs, Feedback, or Enhancement Requests

Please feel free to start an issue or a discussion!

Simple Example, Embedded

Embedding CrawlSharp into your application is simple and requires minimal configuration. Refer to the Test project for a full example.

using CrawlSharp;

Settings settings = new Settings();
settings.Crawl.StartUrl = "http://www.mywebpage.com";
settings.Crawl.UseHeadlessBrowser = true; // slow but useful for sites that block bots or where content must be rendered

using (WebCrawler crawler = new WebCrawler(settings))
{
  await foreach (WebResource resource in crawler.CrawlAsync()) 
    Console.WriteLine(resource.Status + ": " + resource.Url);
}

WebCrawler.CrawlAsync can be awaited, returning an IAsyncEnumerable<WebResource> whereas WebCrawler.Crawl cannot be awaited, returning an IEnumerable<WebResource>.

Web Resources

Objects crawled using CrawlSharp have the following properties:

  • Url - the URL from which the resource was retrieved
  • ParentUrl - the URL from which the Url was identified
  • Filename - the filename component from the URL, if any
  • Depth - the depth level at which the Url was identified
  • Status - the HTTP status code returned when retrieving the Url
  • ContentLength - the content length of the body returned when retrieving Url
  • ContentType - the content type returned while retrieving Url
  • MD5Hash - the MD5 hash of the Data
  • SHA1Hash - the SHA1 hash of the Data
  • SHA256Hash - the SHA256 hash of the Data
  • LastModified - the DateTime from when the headers indicate the object was last modified
  • Headers - a NameValueCollection with the headers returned while retrieving Url
  • Data - a byte[] containing the data returned while retrieving Url

REST API

CrawlSharp includes a project called CrawlSharp.Server which allows you to deploy a RESTful front-end for CrawlSharp. Refer to REST_API.md and also the Postman collection in the root of this repository for details.

CrawlSharp.Server will by default listen on host localhost and port 8000, meaning it will not accept requests from outside of the machine.

To change this, specify the hostname as the first argument and the port as the second, i.e. dotnet CrawlSharp.Server myhostname.com 8888.

$ dotnet CrawlSharp.Server 

                          _     _  _
   ___ _ __ __ ___      _| |  _| || |_
  / __| '__/ _` \ \ /\ / / | |_  ..  _|
 | (__| | | (_| |\ V  V /| | |_      _|
  \___|_|  \__,_| \_/\_/ |_|   |_||_|

(c)2025 Joel Christner


Usage:
  crawlsharp [hostname] [port]

Where:
  [hostname] is the hostname or IP address on which to listen
  [port] is the port number, greater than or equal to zero, and less than 65536

NOTICE
------
Configured to listen on local address 'localhost'
Service will not receive requests from outside of localhost

Webserver started on http://localhost:8000/

2025-03-01 20:39:17 joel-laptop Info [CrawlSharpServer] server started

Refer to REST_API.md for more information about using the RESTful API.

Running in Docker

A Docker image is available in Docker Hub under jchristn/crawlsharp. Use the Docker Compose start (compose-up.sh and compose-up.bat) and stop (compose-down.sh and compose-down.bat) scripts in the Docker directory if you wish to run within Docker Compose.

Using Headless Browser

CrawlSharp can use Microsoft.Playwright to crawl content to overcome challenging websites that detect and block bots or require content to be rendered from Javascript. If you run this code on an Ubuntu machine, use the following script to install dependencies that will be required. Also note that the $HOME directory must be owned by the user running the code.

#!/bin/bash

# Detect Ubuntu version
VERSION=$(lsb_release -rs)

if [[ "$VERSION" == "24.04" ]]; then
    # Ubuntu 24.04 packages
    PACKAGES="libasound2t64 libatk-bridge2.0-0t64 libatk1.0-0t64 libcups2t64 libgtk-3-0t64"
else
    # Ubuntu 22.04 and earlier
    PACKAGES="libasound2 libatk-bridge2.0-0 libatk1.0-0 libcups2 libgtk-3-0"
fi

# Install common packages plus version-specific ones
sudo apt-get update
sudo apt-get install -y \
    $PACKAGES \
    libnspr4 \
    libnss3 \
    libdrm2 \
    libxkbcommon0 \
    libxcomposite1 \
    libxdamage1 \
    libxrandr2 \
    libgbm1 \
    libxss1 \
    fonts-liberation \
    ca-certificates

Third-Party Data

CrawlSharp is licensed under MIT and uses the Nager.PublicSuffix library (MIT license) for domain matching coupled with third-party public suffix data (Mozilla Public License v2.0). Please be aware of the license for this information.

Version History

Please refer to CHANGELOG.md for version history.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.13 110 7 days ago
1.0.12 107 7 days ago
1.0.11 124 9 days ago
1.0.10 146 2 months ago
1.0.9 104 2 months ago
1.0.8 67 2 months ago
1.0.7 70 2 months ago
1.0.6 174 4 months ago
1.0.5 152 4 months ago
1.0.4 157 4 months ago
1.0.3 155 4 months ago
1.0.2 220 5 months ago
1.0.1 195 5 months ago
1.0.0 102 5 months ago

Initial release