CRMScraper.Library 1.1.95

.NET 8.0

dotnet add package CRMScraper.Library --version 1.1.95

NuGet\Install-Package CRMScraper.Library -Version 1.1.95

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="CRMScraper.Library" Version="1.1.95" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add CRMScraper.Library --version 1.1.95

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: CRMScraper.Library, 1.1.95"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install CRMScraper.Library as a Cake Addin
#addin nuget:?package=CRMScraper.Library&version=1.1.95

// Install CRMScraper.Library as a Cake Tool
#tool nuget:?package=CRMScraper.Library&version=1.1.95

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

CRM Scraper

Features

Static HTML Parsing: Scrape static websites using HtmlAgilityPack.
Dynamic Content Scraping: Use Playwright to scrape JavaScript-heavy websites.
Extensible API: Flexible and easily extendable for custom requirements.
Retry Mechanism: Built-in retry logic with exponential backoff.
Concurrent Scraping: Supports scraping multiple pages simultaneously.
Unit Tested: Extensive test coverage using xUnit.

NuGet Package

You can install the CRMScraper.Library package via NuGet:

Platform	Version
.NET 8.0	1.1.58

Installation

To install the package via .NET CLI:

dotnet add package CRMScraper.Library --version 1.1.58

To install via the NuGet Package Manager:

Install-Package CRMScraper.Library -Version 1.1.58

Dependencies

HtmlAgilityPack (>= 1.11.65)
Microsoft.Playwright (>= 1.47.0)

Project Structure

.
├── .github                     # GitHub Actions for CI/CD workflows
├── .gitignore                   # Git ignore rules
├── README.md                    # Project documentation
├── samples                      # Sample applications for testing
│   └── ScraperConsoleApp        # Console application for manual testing
├── scraping_service_library_net.sln # Solution file
├── scripts                      # Scripts for building and publishing
│   ├── build_and_test.sh        # Script for building and running tests
│   └── publish_nuget.sh         # Script for packing and publishing NuGet packages
├── src
│   ├── CRMScraper.Library       # Main library containing the scraping logic
│   │   ├── Core                 # Core components for scraping logic
│   │   ├── CRMScraper.Library.csproj # Library project file
│   └── CRMScraper.Tests         # Unit tests for the library
└── scraping_service_library_net.sln # Solution file

Getting Started

Prerequisites

.NET 8 SDK or later
Playwright (for dynamic content scraping)

Installation

Clone the repository:

git clone https://github.com/yourusername/scraping_service_library_net.git
cd scraping_service_library_net

Restore dependencies:
```
dotnet restore
```
Build the project:
```
dotnet build --configuration Release
```
Run the console application:
```
cd ScraperConsoleApp
dotnet run
```

Running Tests

The project uses xUnit for unit tests and coverlet for code coverage. To run the tests and generate coverage reports:

dotnet test --configuration Release --collect:"XPlat Code Coverage" --results-directory TestResults/ --logger "trx;LogFileName=TestResults.trx"

CI/CD Pipeline

This project uses GitHub Actions for continuous integration and deployment. The pipeline automatically:

Builds the project
Runs unit tests with code coverage
Generates a NuGet package and uploads it as an artifact

See .github/workflows/dotnet-ci.yml for the pipeline configuration.

Creating a NuGet Package

To create a NuGet package, run the following command:

dotnet pack --configuration Release --output ./nupkgs

Usage

This section explains how to use the CRMScraper.Library for both static and dynamic content scraping.

1. Scraping Static Pages

Use the ScraperClient class to scrape static web pages and extract HTML content, JavaScript, and API requests.

Example: Scraping a Static Page

using CRMScraper.Library;
using CRMScraper.Library.Core;
using System.Net.Http;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();
        var pageElementsExtractor = new PageElementsExtractor();  // Implement to extract JavaScript and API requests
        var scraperClient = new ScraperClient(httpClient, pageElementsExtractor);

        var result = await scraperClient.ScrapePageAsync("https://example.com");

        Console.WriteLine($"URL: {result.Url}");
        Console.WriteLine($"HTML Content: {result.HtmlContent}");
        Console.WriteLine($"JavaScript Data: {string.Join(", ", result.JavaScriptData)}");
        Console.WriteLine($"API Requests: {string.Join(", ", result.ApiRequests)}");
    }
}

2. Scraping Dynamic Pages

For JavaScript-heavy websites, ScraperClient uses Playwright to fully render the page before scraping.

Example: Scraping a Dynamic Page

using CRMScraper.Library;
using CRMScraper.Library.Core;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();
        var pageElementsExtractor = new PageElementsExtractor();  // Implement to extract JavaScript and API requests
        var scraperClient = new ScraperClient(httpClient, pageElementsExtractor);

        var result = await scraperClient.ScrapeDynamicPageAsync("https://example.com");

        Console.WriteLine($"URL: {result.Url}");
        Console.WriteLine($"HTML Content: {result.HtmlContent}");
        Console.WriteLine($"API Requests: {string.Join(", ", result.ApiRequests)}");
    }
}

3. Concurrent Scraping

For large-scale scraping, use ScraperTaskExecutor to scrape multiple pages concurrently.

Example: Concurrent Scraping Task

using CRMScraper.Library.Core;
using System;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();
        var pageElementsExtractor = new PageElementsExtractor();
        var scraperClient = new ScraperClient(httpClient, pageElementsExtractor);
        var scraperTaskExecutor = new ScraperTaskExecutor(scraperClient);

        var scrapingTask = new ScrapingTask
        {
            TargetUrl = "https://example.com",
            MaxPages = 10,
            TimeLimit = TimeSpan.FromMinutes(1),
            MaxConcurrentPages = 3,
            UseDynamicScraping = true
        };

        var cancellationTokenSource = new CancellationTokenSource();
        var results = await scraperTaskExecutor.ExecuteScrapingTaskAsync(scrapingTask, cancellationTokenSource.Token);

        foreach (var result in results)
        {
            Console.WriteLine($"Scraped URL: {result.Url}");
            Console.WriteLine($"HTML Content: {result.HtmlContent}");
        }
    }
}

Core Classes

ScraperClient: Core logic for static and dynamic page scraping.
ScraperTaskExecutor: Manages concurrent scraping tasks and retries.
ScrapedPageResult: Represents the result of a scraping operation.
ScrapingTask: Defines a scraping task with limits on pages and time.

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- HtmlAgilityPack (>= 1.11.65)
- Microsoft.Playwright (>= 1.47.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
1.1.95	123	9/18/2024
1.1.92	89	9/18/2024
1.1.89	93	9/18/2024
1.1.84	85	9/18/2024
1.1.79	97	9/18/2024
1.1.65	80	9/17/2024
1.1.58	81	9/17/2024
1.1.52	83	9/17/2024

CRMScraper.Library 1.1.95

CRM Scraper

Features

NuGet Package

Installation

Dependencies

Project Structure

Getting Started

Prerequisites

Installation

Running Tests

CI/CD Pipeline

Creating a NuGet Package

Usage

1. Scraping Static Pages

Example: Scraping a Static Page

2. Scraping Dynamic Pages

Example: Scraping a Dynamic Page

3. Concurrent Scraping

Example: Concurrent Scraping Task

Core Classes

Contributing

License

net8.0

NuGet packages

GitHub repositories