ScrapeAAS.MessagePipe
1.0.2
See the version list below for details.
dotnet add package ScrapeAAS.MessagePipe --version 1.0.2
NuGet\Install-Package ScrapeAAS.MessagePipe -Version 1.0.2
<PackageReference Include="ScrapeAAS.MessagePipe" Version="1.0.2" />
paket add ScrapeAAS.MessagePipe --version 1.0.2
#r "nuget: ScrapeAAS.MessagePipe, 1.0.2"
// Install ScrapeAAS.MessagePipe as a Cake Addin #addin nuget:?package=ScrapeAAS.MessagePipe&version=1.0.2 // Install ScrapeAAS.MessagePipe as a Cake Tool #tool nuget:?package=ScrapeAAS.MessagePipe&version=1.0.2
Scrape as a service
ScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.
Quickstart
Add ASP.NET Hosting
, ScrapeAAS
, a validator of your choice (here Dawn.Guard RIP), and a object mapper of your choice (here AutoMapper), and the database/messagequeue you feel most comftable with (here EFcore with SQLite).
dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection
Full example of scraping the r/dotnet subreddit.
Create a crawler, a that service periodically triggers scraping
var builder = Host.CreateApplicationBuilder(args);
builder.Services
.AddAutoMapper()
.AddScrapeAAS()
.AddHostedService<RedditSubredditCrawler>()
.AddDataflow<RedditPostSpider>()
.AddDataflow<RedditSqliteSink>()
sealed class RedditSubredditCrawler : BackgroundService {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher<RedditPost> _publisher;
...
protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
... execute service scope periotically
}
private async Task CrawlAsync(IDataflowPublisher<RedditSubreddit> publisher, CancellationToken stoppingToken)
{
_logger.LogInformation("Crawling /r/dotnet");
await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
_logger.LogInformation("Crawling complete");
}
}
Implement your spiders, services that collect, and normalize data.
sealed class RedditPostSpider : IDataflowHandler<RedditSubreddit> {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher<RedditComment> _publisher;
...
private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
{
Url root = new("https://old.reddit.com/");
_logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
_logger.LogInformation("Request complete");
var queriedContent = document
.QuerySelectorAll("div.thing")
.AsParallel()
.Select(div => new
{
PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
Title = div.QuerySelector("a.title")?.TextContent,
Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
Comments = div.QuerySelector("a.comments")?.TextContent,
CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
PostedBy = div.QuerySelector("a.author")?.TextContent,
})
.Select(queried => new RedditPost(
new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
Guard.Argument(queried.Title).NotEmpty(),
long.Parse(queried.Upvotes.AsSpan()),
Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
new(queried.CommentsUrl),
DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
new(Guard.Argument(queried.PostedBy).NotEmpty())
), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
foreach (var item in queriedContent)
{
await _publisher.PublishAsync(item, stoppingToken);
}
_logger.LogInformation("Parsing complete");
}
}
Add a sink, a service that commits the scraped data disk/network.
sealed class RedditSqliteSink : IAsyncDisposable, IDataflowHandler<RedditSubreddit>, IDataflowHandler<RedditPost>
{
private readonly RedditPostSqliteContext _context;
private readonly IMapper _mapper;
...
public async ValueTask DisposeAsync()
{
await _context.Database.EnsureCreatedAsync();
await _context.SaveChangesAsync();
}
public async ValueTask HandleAsync(RedditSubreddit message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map<RedditSubredditDto>(message);
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Subreddits.AddAsync(messageDto, cancellationToken);
}
public async ValueTask HandleAsync(RedditPost message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map<RedditPostDto>(message);
if (await _context.Users.FindAsync(new object[] { message.PostedBy.Id }, cancellationToken) is { } existingUser)
{
messageDto.PostedById = existingUser.Id;
messageDto.PostedBy = existingUser;
}
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Posts.AddAsync(messageDto, cancellationToken);
}
}
Why not WebReaper or DotnetSpider?
I have tried both toolstacks, and found them wanting. So I tried to make it better by delegating as much work as reasonable to existing projects.
In addition to my own goals; from evaluating both libraries I wish to keep all thier pros, and discard all their cons. The verbocity of this library sits comtably between WebReaper and DotnetSpider, but more towards the DotnetSpider end of things.
- Integration into ASP.NET Hosting.
- No dependencies at the core of the project. Instead package a reasonable set of addons by default.
- Use and expose integrated NuGet packages in addons when possible to allow develops to benefit form existing ecosystems.
Evaluation of DotnetSpider
The overall data flow in ScrapeAAS
is adopted from DotnetSpider
: Crawler --> Spider --> Sink .
- Pro: Pub/Sub event handling for decoupled data flow.
- Pro: Easy extendibility by tapping events.
- Con: Terrible debugging experience using model annotations.
- Con: Smelly
dynamic
riddeled design when storing to a database. - Con: Retry policies missing.
- Con: Much boilerplate nessessary.
Evaluation of WebReaper
The Puppeteer browser handling is a mixture of the lifetime tracking http handler and the WebReaper Puppeteer integration.
- Pro: Simple declarative builder API. No boilderplate needed.
- Pro: Easy extendibility by implementing interfaces.
- Pro: Puppeteer browser.
- Con: Unable to control data flow.
- Con: Unable to parse data.
- Con: No ASP.NET or any DI integration possible.
- Con: Dependencies for optional extendibilites, such as
Redis
,MySql
,RabbitMq
, are always included in the package.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net6.0
- MessagePipe (>= 1.7.4)
- Microsoft.Extensions.Hosting (>= 7.0.1)
- ScrapeAAS.Contracts (>= 1.0.2)
-
net7.0
- MessagePipe (>= 1.7.4)
- Microsoft.Extensions.Hosting (>= 7.0.1)
- ScrapeAAS.Contracts (>= 1.0.2)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
1.0.3 | 126 | 12/31/2023 |
1.0.2 | 62 | 12/31/2023 |
1.0.1 | 66 | 12/31/2023 |
1.0.0 | 60 | 12/21/2023 |
0.1.2 | 120 | 11/5/2023 |
0.1.1 | 75 | 10/15/2023 |
0.1.0 | 73 | 10/14/2023 |
0.1.0-hotfix.1 | 59 | 10/15/2023 |
0.1.0-alpha.3 | 60 | 10/14/2023 |
0.0.0-preview.0.71 | 62 | 10/14/2023 |