WebTools.HtmlScrapper
1.0.0
dotnet add package WebTools.HtmlScrapper --version 1.0.0
NuGet\Install-Package WebTools.HtmlScrapper -Version 1.0.0
<PackageReference Include="WebTools.HtmlScrapper" Version="1.0.0" />
paket add WebTools.HtmlScrapper --version 1.0.0
#r "nuget: WebTools.HtmlScrapper, 1.0.0"
// Install WebTools.HtmlScrapper as a Cake Addin #addin nuget:?package=WebTools.HtmlScrapper&version=1.0.0 // Install WebTools.HtmlScrapper as a Cake Tool #tool nuget:?package=WebTools.HtmlScrapper&version=1.0.0
HTML Scrapper for .NET
Introduction
There are incredible options for web scrapping like BeautifulSoup4 in python, but I needed an easy to use and understand scrapper for .Net family. So, I started this simple scrapper, extensible and easy to understand. Using a declarative syntax you'll be able to scrap information from HTML documents (extensible to others markup documents), from a file, url, and more.
How to use
The first thing you need to do, is to load the content. A class called HtmlDocument
contains some static methods for this:
LoadFromText(string content);
LoadFromPath(string path);
LoadFromStream(Stream stream, Encoding encoding = null);
LoadFromStreamAsync(Stream stream, Encoding encoding =null);
LoadFromUrl(string url);
These method return a Document
instance where you can use the Scrap
for starting scrapping.
var rootNode = HtmlDocument.LoadFromUrl("https://www.foo.com").Scrap;
At this moment, the content was already loaded and parsed, so you can start to move through the document tree. The Scrap
property returns te root node/tag (usually the html
tag).
TagNode class
The TagNode class represents a single tag in the document. It contains the following properties and methods that you could to extract info and moving through the document tree from a single Tag:
//Properties
string Name
Dictionary<string,string> Attributes
List<TagNode> Children
TagNode Parent
IEnumerable<TagNode> Siblings
TagNode NextSibling // The following Tag in the same level
TagNode PrevSibling // The previous Tag in the same level
IEnumerable<TagNode> NextFullSiblings // The following Tags in the same level
IEnumerable<TagNode> PrevFullSiblings // The previous Tags in the same level
IEnumerable<TagNode> Descendants // uses BFS
IEnumerable<TagNode> Ancestors
string Text // The text in the current level (directly in the current Tag)
string GetFullText // The full text inside the current tag and its descendants
//Methods
IEnumerable<TagNode> FindAll() // returns all the tags starting in the current one
IEnumerable<TagNode> FindAll(Func<TagNode, bool> filter) // filter the tags with a current condition
IEnumerable<TagNode> Find(Func<TagNode, IEnumerable<TagNode>> iter, Func<TagNode, bool> filter) // filter the tags following a specific iterator and condition
IEnumerable<TagNode> FindAncestors(Func<TagNode, bool> filter) // the same as FindAll but going up in the tree
Also, you can use some IEnumerable
extension methods provided to filter and move through the tree:
// Filters elements by a Tag name
IEnumerable<TagNode> WithTag(this IEnumerable<TagNode> obj, string tagName)
// Filters current elements by a list of Tag names
IEnumerable<TagNode> WithTag(this IEnumerable<TagNode> obj, params string[] tagsNames)
// Filters elements keeping those that contains an specific attribute
IEnumerable<TagNode> WithAttribute(this IEnumerable<TagNode> obj, string attrName)
// Filters elements by an attribute and a value
IEnumerable<TagNode> WithAttribute(this IEnumerable<TagNode> obj, string attrName, string attrValue)
// Filters elements by a specific class
IEnumerable<TagNode> WithClass(this IEnumerable<TagNode> obj, string className)
// Filters elements using a condition that applies to the classes
IEnumerable<TagNode> WithClass(this IEnumerable<TagNode> obj, Func<string, bool> func)
Remember you can use any of the usual extension methods like, Where
, Any
, All
, and more.
Also, you can chain all this methods when possible:
var links = node.Descendants
.WithTag("a")
.WithClass(
(c)=>c.StartsWith("link-"))
.Select((t) => t.Attributes["href"]);
ToDo
There are a lot of pending functionality, and improvements I want to do to create a fluid use. Please, any idea or request, just contact me.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.2 is compatible. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
-
.NETCoreApp 2.2
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
1.0.0 | 1,977 | 9/10/2019 |