Lexbor.jl

This package provides a Julia interface to the lexbor HTML parsing library. Lexbor.jl integrates with AbstractTrees.jl to provide an interface for traversing the HTML tree.

Currently the only exposed parts of the library are HTML parsing and DOM querying.

Usage

The package exports it's public interface, but prefer using qualified identifiers rather than using.

julia> import Lexbor

Parsing HTML

Create a new DOM object using the Document constructor.

julia> doc = Lexbor.Document("<div class='callout'><a href='#'>Link</a></div>")Document(source = nothing)

Or you can parse a file with Base.open:

julia> doc = open(Lexbor.Document, "file.html")Document(source = "file.html")

Querying documents

Use query to search for nodes within the document that match the provided CSS selector.

julia> links = Lexbor.query(doc, "a")1-element Vector{Node}:
 <a>
julia> callouts = Lexbor.query(doc, "div.callout")1-element Vector{Node}:
 <div>

query also supports passing a function as the first argument that will be called on each matching Node that is found. Using this method avoids allocating a vector and iterating over the results twice if they don't need to be stored.

julia> Lexbor.query(doc, "a") do link
           @show Lexbor.attributes(link)
       endLexbor.attributes(link) = Dict{String, Union{Nothing, String}}("href" => "#")

Iteration

You can use any AbstractTrees iterators to traverse the document contents.

julia> import AbstractTrees
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) if Lexbor.is_text(node) @show Lexbor.text(node) end endLexbor.text(node) = "\n " Lexbor.text(node) = "Link" Lexbor.text(node) = "\n" Lexbor.text(node) = "\n"

This uses is_text to check whether the current node is a plain text Node and then displays the text content of it using text. Note that newlines and other whitespace is preserved by lexbor's parsing.

Other predicates and accessors available are:

Matching Nodes

Matcher allows for testing a Node to determine whether it matches the given CSS selector.

julia> matcher = Lexbor.Matcher("div.callout")Matcher("div.callout")
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc)) if matcher(node) @show node end endnode = <div>

As with query you can pass a function as the first argument to a Matcher object in which case it will get called when the Node matches and will return nothing instead of true/false.

julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc))
           matcher(node) do matched
               @show matched
           end
       endmatched = <div>

API

Lexbor.DocumentType
Document(html_str)
open(Document, html_file_path)

Parse HTML into an in-memory tree representing the DOM. To parse an HTML file use Base.open.

source
Lexbor.MatcherType
Matcher(selector; first = false)

Create a new Matcher object that can be used to test whether a Node matches the selector or not. Matcher objects are callable and can be used as follows:

function find_first_node(root, selector)
    # Create the `Matcher` object once, then reuse in the loop.
    matcher = Matcher(selector)
    for node in AbstractTrees.PreOrderDFS(root)
        if matcher(node)
            return node
        end
    end
    return nothing
end

The matcher object has two callable methods. The first, shown above, returns true or false depending on whether the selector matches. The other method takes a first argument function ::Node -> Nothing that is called when the selector matches.

The keyword argument first::Bool has the same behaviour as the first keyword provided by query. See that function's documentation for details.

source
Lexbor.NodeType
Node(document::Document)

An iterable object representing a particular node within an HTML Document.

source
Lexbor.TreeType
Tree(document)
Tree(node)

A display type to help visualize DOM structure.

source
Lexbor.attributesMethod
attributes(node::Node) -> Dict{String,Union{String,Nothing}} | Nothing

Return a Dict of all the attributes of a node, or nothing when it is not a valid element node.

source
Lexbor.commentMethod
comment(node) -> String | nothing

Return the comment content of a node, or nothing when the node is not a valid comment node.

source
Lexbor.is_commentMethod
is_comment(node::Node) -> Bool

Is node a comment Node? E.g. <!-- ... --> syntax.

Use comment to access the String contents of the comment.

source
Lexbor.is_elementMethod
is_element(node::Node) -> Bool

Is the node an HTML element Node? E.g. a <a>, <div>, etc.

Use tag to access the name of the element as a Symbol and use attributes to access the element attributes.

source
Lexbor.is_textMethod
is_text(node::Node) -> Bool

Is the node a plain text string?

Use text to access the String contents of the node.

source
Lexbor.queryFunction
query(document | node, selector; first = false, root = false) -> Node[]
query(f, document | node, selector; first = false, root = false) -> nothing

Query the document or node for the given CSS selector. When f is provided then call f on each match that is found and return nothing from query. When no f is provided then just return a Vector{Node} containing all matches.

The first::Bool keyword controls whether to only match the first of a selector list. To quote the upstream documentation:

Stop searching after the first match with any of the selectors in the list.

By default, the callback will be triggered for each selector list. That is, if your node matches different selector lists, it will be returned multiple times in the callback.

For example:

HTML: <div id="ok"><span>test</span></div>
Selectors: div, div[id="ok"], div:has(:not(a))

The default behavior will cause three callbacks with the same node (div). Because it will be found by every selector in the list.

This option allows you to end the element check after the first match on any of the selectors. That is, the callback will be called only once for example above. This way we get rid of duplicates in the search.

The root::Bool keyword controls whether to include the root node in the search. To quote the upstream documentation:

Includes the passed (root) node in the search.

By default, the root node does not participate in selector searches, only its children.

This behavior is logical, if you have found a node and then you want to search for other nodes in it, you don't need to check it again.

But there are cases when it is necessary for root node to participate in the search. That's what this option is for.

source
Lexbor.tagMethod
tag(node) -> Symbol | Nothing

Return the element tag name, or nothing when it is not an element.

source
Lexbor.textMethod
text(node) -> String | Nothing

Return the text content of a node, or nothing when the node is not a valid text node.

source