Lexbor.jl
This package provides a Julia interface to the lexbor HTML parsing library. Lexbor.jl
integrates with AbstractTrees.jl
to provide an interface for traversing the HTML tree.
Currently the only exposed parts of the library are HTML parsing and DOM querying.
Usage
The package export
s it's public interface, but prefer using qualified identifiers rather than using
.
julia> import Lexbor
Parsing HTML
Create a new DOM object using the Document
constructor.
julia> doc = Lexbor.Document("<div class='callout'><a href='#'>Link</a></div>")
Document(source = nothing)
Or you can parse a file with Base.open
:
julia> doc = open(Lexbor.Document, "file.html")
Document(source = "file.html")
Querying documents
Use query
to search for nodes within the document that match the provided CSS selector.
julia> links = Lexbor.query(doc, "a")
1-element Vector{Node}: <a>
julia> callouts = Lexbor.query(doc, "div.callout")
1-element Vector{Node}: <div>
query
also supports passing a function as the first argument that will be called on each matching Node
that is found. Using this method avoids allocating a vector and iterating over the results twice if they don't need to be stored.
julia> Lexbor.query(doc, "a") do link @show Lexbor.attributes(link) end
Lexbor.attributes(link) = Dict{String, Union{Nothing, String}}("href" => "#")
Iteration
You can use any AbstractTrees
iterators to traverse the document contents.
julia> import AbstractTrees
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) if Lexbor.is_text(node) @show Lexbor.text(node) end end
Lexbor.text(node) = "\n " Lexbor.text(node) = "Link" Lexbor.text(node) = "\n" Lexbor.text(node) = "\n"
This uses is_text
to check whether the current node
is a plain text Node
and then displays the text content of it using text
. Note that newlines and other whitespace is preserved by lexbor's parsing.
Other predicates and accessors available are:
Matching Node
s
Matcher
allows for testing a Node
to determine whether it matches the given CSS selector.
julia> matcher = Lexbor.Matcher("div.callout")
Matcher("div.callout")
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) if matcher(node) @show node end end
node = <div>
As with query
you can pass a function as the first argument to a Matcher
object in which case it will get called when the Node
matches and will return nothing
instead of true
/false
.
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) matcher(node) do matched @show matched end end
matched = <div>
API
Lexbor.Document
— TypeDocument(html_str)
open(Document, html_file_path)
Parse HTML into an in-memory tree representing the DOM. To parse an HTML file use Base.open
.
Lexbor.Matcher
— TypeMatcher(selector; first = false)
Create a new Matcher
object that can be used to test whether a Node
matches the selector
or not. Matcher
objects are callable and can be used as follows:
function find_first_node(root, selector)
# Create the `Matcher` object once, then reuse in the loop.
matcher = Matcher(selector)
for node in AbstractTrees.PreOrderDFS(root)
if matcher(node)
return node
end
end
return nothing
end
The matcher
object has two callable methods. The first, shown above, returns true
or false
depending on whether the selector matches. The other method takes a first argument function ::Node -> Nothing
that is called when the selector matches.
The keyword argument first::Bool
has the same behaviour as the first
keyword provided by query
. See that function's documentation for details.
Lexbor.Node
— TypeNode(document::Document)
An iterable object representing a particular node within an HTML Document
.
Lexbor.Tree
— TypeTree(document)
Tree(node)
A display type to help visualize DOM structure.
Lexbor.attributes
— Methodattributes(node::Node) -> Dict{String,Union{String,Nothing}} | Nothing
Return a Dict
of all the attributes of a node, or nothing
when it is not a valid element node.
Lexbor.comment
— Methodcomment(node) -> String | nothing
Return the comment content of a node, or nothing
when the node is not a valid comment node.
Lexbor.is_comment
— Methodis_comment(node::Node) -> Bool
Is node
a comment Node
? E.g. <!-- ... -->
syntax.
Use comment
to access the String
contents of the comment.
Lexbor.is_element
— Methodis_element(node::Node) -> Bool
Is the node
an HTML element Node
? E.g. a <a>
, <div>
, etc.
Use tag
to access the name of the element as a Symbol
and use attributes
to access the element attributes.
Lexbor.is_text
— Methodis_text(node::Node) -> Bool
Is the node
a plain text string?
Use text
to access the String
contents of the node
.
Lexbor.query
— Functionquery(document | node, selector; first = false, root = false) -> Node[]
query(f, document | node, selector; first = false, root = false) -> nothing
Query the document
or node
for the given CSS selector
. When f
is provided then call f
on each match that is found and return nothing
from query
. When no f
is provided then just return a Vector{Node}
containing all matches.
The first::Bool
keyword controls whether to only match the first of a selector list. To quote the upstream documentation:
Stop searching after the first match with any of the selectors in the list.
By default, the callback will be triggered for each selector list. That is, if your node matches different selector lists, it will be returned multiple times in the callback.
For example:
HTML: <div id="ok"><span>test</span></div> Selectors: div, div[id="ok"], div:has(:not(a))
The default behavior will cause three callbacks with the same node (div). Because it will be found by every selector in the list.
This option allows you to end the element check after the first match on any of the selectors. That is, the callback will be called only once for example above. This way we get rid of duplicates in the search.
The root::Bool
keyword controls whether to include the root node in the search. To quote the upstream documentation:
Includes the passed (root) node in the search.
By default, the root node does not participate in selector searches, only its children.
This behavior is logical, if you have found a node and then you want to search for other nodes in it, you don't need to check it again.
But there are cases when it is necessary for root node to participate in the search. That's what this option is for.
Lexbor.tag
— Methodtag(node) -> Symbol | Nothing
Return the element tag name, or nothing
when it is not an element.
Lexbor.text
— Methodtext(node) -> String | Nothing
Return the text content of a node, or nothing
when the node is not a valid text node.