Lexbor.jl
This package provides a Julia interface to the lexbor HTML parsing library. Lexbor.jl integrates with AbstractTrees.jl to provide an interface for traversing the HTML tree.
Currently the only exposed parts of the library are HTML parsing and DOM querying.
Usage
The package exports it's public interface, but prefer using qualified identifiers rather than using.
julia> import Lexbor
Parsing HTML
Create a new DOM object using the Document constructor.
julia> doc = Lexbor.Document("<div class='callout'><a href='#'>Link</a></div>")Document(source = nothing)
Or you can parse a file with Base.open:
julia> doc = open(Lexbor.Document, "file.html")Document(source = "file.html")
Querying documents
Use query to search for nodes within the document that match the provided CSS selector.
julia> links = Lexbor.query(doc, "a")1-element Vector{Node}: <a>
julia> callouts = Lexbor.query(doc, "div.callout")1-element Vector{Node}: <div>
query also supports passing a function as the first argument that will be called on each matching Node that is found. Using this method avoids allocating a vector and iterating over the results twice if they don't need to be stored.
julia> Lexbor.query(doc, "a") do link @show Lexbor.attributes(link) endLexbor.attributes(link) = Dict{String, Union{Nothing, String}}("href" => "#")
Iteration
You can use any AbstractTrees iterators to traverse the document contents.
julia> import AbstractTreesjulia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) if Lexbor.is_text(node) @show Lexbor.text(node) end endLexbor.text(node) = "\n " Lexbor.text(node) = "Link" Lexbor.text(node) = "\n" Lexbor.text(node) = "\n"
This uses is_text to check whether the current node is a plain text Node and then displays the text content of it using text. Note that newlines and other whitespace is preserved by lexbor's parsing.
Other predicates and accessors available are:
Matching Nodes
Matcher allows for testing a Node to determine whether it matches the given CSS selector.
julia> matcher = Lexbor.Matcher("div.callout")Matcher("div.callout")julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) if matcher(node) @show node end endnode = <div>
As with query you can pass a function as the first argument to a Matcher object in which case it will get called when the Node matches and will return nothing instead of true/false.
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) matcher(node) do matched @show matched end endmatched = <div>
API
Lexbor.Document — TypeDocument(html_str)
open(Document, html_file_path)Parse HTML into an in-memory tree representing the DOM. To parse an HTML file use Base.open.
Lexbor.Matcher — TypeMatcher(selector; first = false)Create a new Matcher object that can be used to test whether a Node matches the selector or not. Matcher objects are callable and can be used as follows:
function find_first_node(root, selector)
# Create the `Matcher` object once, then reuse in the loop.
matcher = Matcher(selector)
for node in AbstractTrees.PreOrderDFS(root)
if matcher(node)
return node
end
end
return nothing
endThe matcher object has two callable methods. The first, shown above, returns true or false depending on whether the selector matches. The other method takes a first argument function ::Node -> Nothing that is called when the selector matches.
The keyword argument first::Bool has the same behaviour as the first keyword provided by query. See that function's documentation for details.
Lexbor.Node — TypeNode(document::Document)An iterable object representing a particular node within an HTML Document.
Lexbor.Tree — TypeTree(document)
Tree(node)A display type to help visualize DOM structure.
Lexbor.attributes — Methodattributes(node::Node) -> Dict{String,Union{String,Nothing}} | NothingReturn a Dict of all the attributes of a node, or nothing when it is not a valid element node.
Lexbor.comment — Methodcomment(node) -> String | nothingReturn the comment content of a node, or nothing when the node is not a valid comment node.
Lexbor.is_comment — Methodis_comment(node::Node) -> BoolIs node a comment Node? E.g. <!-- ... --> syntax.
Use comment to access the String contents of the comment.
Lexbor.is_element — Methodis_element(node::Node) -> BoolIs the node an HTML element Node? E.g. a <a>, <div>, etc.
Use tag to access the name of the element as a Symbol and use attributes to access the element attributes.
Lexbor.is_text — Methodis_text(node::Node) -> BoolIs the node a plain text string?
Use text to access the String contents of the node.
Lexbor.query — Functionquery(document | node, selector; first = false, root = false) -> Node[]
query(f, document | node, selector; first = false, root = false) -> nothingQuery the document or node for the given CSS selector. When f is provided then call f on each match that is found and return nothing from query. When no f is provided then just return a Vector{Node} containing all matches.
The first::Bool keyword controls whether to only match the first of a selector list. To quote the upstream documentation:
Stop searching after the first match with any of the selectors in the list.
By default, the callback will be triggered for each selector list. That is, if your node matches different selector lists, it will be returned multiple times in the callback.
For example:
HTML: <div id="ok"><span>test</span></div> Selectors: div, div[id="ok"], div:has(:not(a))The default behavior will cause three callbacks with the same node (div). Because it will be found by every selector in the list.
This option allows you to end the element check after the first match on any of the selectors. That is, the callback will be called only once for example above. This way we get rid of duplicates in the search.
The root::Bool keyword controls whether to include the root node in the search. To quote the upstream documentation:
Includes the passed (root) node in the search.
By default, the root node does not participate in selector searches, only its children.
This behavior is logical, if you have found a node and then you want to search for other nodes in it, you don't need to check it again.
But there are cases when it is necessary for root node to participate in the search. That's what this option is for.
Lexbor.tag — Methodtag(node) -> Symbol | NothingReturn the element tag name, or nothing when it is not an element.
Lexbor.text — Methodtext(node) -> String | NothingReturn the text content of a node, or nothing when the node is not a valid text node.