Lexbor.jl

This package provides a Julia interface to the lexbor HTML parsing library. Lexbor.jl integrates with AbstractTrees.jl to provide an interface for traversing the HTML tree.

Currently the only exposed parts of the library are HTML parsing and DOM querying.

Usage

The package exports it's public interface, but prefer using qualified identifiers rather than using.

julia> import Lexbor

Parsing HTML

Create a new DOM object using the Document constructor.

julia> doc = Lexbor.Document("<div class='callout'><a href='#'>Link</a></div>")Document(source = nothing)

Or you can parse a file with Base.open:

julia> doc = open(Lexbor.Document, "file.html")Document(source = "file.html")

Querying documents

Use query to search for nodes within the document that match the provided CSS selector.

julia> links = Lexbor.query(doc, "a")1-element Vector{Node}:
 <a>

julia> callouts = Lexbor.query(doc, "div.callout")1-element Vector{Node}:
 <div>

query also supports passing a function as the first argument that will be called on each matching Node that is found. Using this method avoids allocating a vector and iterating over the results twice if they don't need to be stored.

julia> Lexbor.query(doc, "a") do link
           @show Lexbor.attributes(link)
       endLexbor.attributes(link) = Dict{String, Union{Nothing, String}}("href" => "#")

Iteration

You can use any AbstractTrees iterators to traverse the document contents.

julia> import AbstractTrees
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc))
           if Lexbor.is_text(node)
               @show Lexbor.text(node)
           end
       endLexbor.text(node) = "\n    "
Lexbor.text(node) = "Link"
Lexbor.text(node) = "\n"
Lexbor.text(node) = "\n"

This uses is_text to check whether the current node is a plain text Node and then displays the text content of it using text. Note that newlines and other whitespace is preserved by lexbor's parsing.

Other predicates and accessors available are:

is_element
is_comment
tag
comment

Matching `Node`s

Matcher allows for testing a Node to determine whether it matches the given CSS selector.

julia> matcher = Lexbor.Matcher("div.callout")Matcher("div.callout")
julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc))
           if matcher(node)
               @show node
           end
       endnode = <div>

As with query you can pass a function as the first argument to a Matcher object in which case it will get called when the Node matches and will return nothing instead of true/false.

julia> for node in AbstractTrees.PreOrderDFS(Lexbor.Node(doc))
           matcher(node) do matched
               @show matched
           end
       endmatched = <div>

API

Lexbor.Document — Type

Document(html_str)
open(Document, html_file_path)

Parse HTML into an in-memory tree representing the DOM. To parse an HTML file use Base.open.

source

Lexbor.Matcher — Type

Matcher(selector; first = false)

Create a new Matcher object that can be used to test whether a Node matches the selector or not. Matcher objects are callable and can be used as follows:

function find_first_node(root, selector)
    # Create the `Matcher` object once, then reuse in the loop.
    matcher = Matcher(selector)
    for node in AbstractTrees.PreOrderDFS(root)
        if matcher(node)
            return node
        end
    end
    return nothing
end

The matcher object has two callable methods. The first, shown above, returns true or false depending on whether the selector matches. The other method takes a first argument function ::Node -> Nothing that is called when the selector matches.

The keyword argument first::Bool has the same behaviour as the first keyword provided by query. See that function's documentation for details.

source

Lexbor.Node — Type

Node(document::Document)

An iterable object representing a particular node within an HTML Document.

source

Lexbor.Tree — Type

Tree(document)
Tree(node)

A display type to help visualize DOM structure.

source

Lexbor.attributes — Method

attributes(node::Node) -> Dict{String,Union{String,Nothing}} | Nothing

Return a Dict of all the attributes of a node, or nothing when it is not a valid element node.

source

Lexbor.comment — Method

comment(node) -> String | nothing

Return the comment content of a node, or nothing when the node is not a valid comment node.

source

Lexbor.is_comment — Method

is_comment(node::Node) -> Bool

Is node a comment Node? E.g.  syntax.

Use comment to access the String contents of the comment.

source

Lexbor.is_element — Method

is_element(node::Node) -> Bool

Is the node an HTML element Node? E.g. a <a>, <div>, etc.

Use tag to access the name of the element as a Symbol and use attributes to access the element attributes.

source

Lexbor.is_text — Method

is_text(node::Node) -> Bool

Is the node a plain text string?

Use text to access the String contents of the node.

source

Lexbor.query — Function

query(document | node, selector; first = false, root = false) -> Node[]
query(f, document | node, selector; first = false, root = false) -> nothing

Query the document or node for the given CSS selector. When f is provided then call f on each match that is found and return nothing from query. When no f is provided then just return a Vector{Node} containing all matches.

The first::Bool keyword controls whether to only match the first of a selector list. To quote the upstream documentation:

Stop searching after the first match with any of the selectors in the list.
By default, the callback will be triggered for each selector list. That is, if your node matches different selector lists, it will be returned multiple times in the callback.
For example:
HTML: <div id="ok"><span>test</span></div>
Selectors: div, div[id="ok"], div:has(:not(a))
The default behavior will cause three callbacks with the same node (div). Because it will be found by every selector in the list.
This option allows you to end the element check after the first match on any of the selectors. That is, the callback will be called only once for example above. This way we get rid of duplicates in the search.

The root::Bool keyword controls whether to include the root node in the search. To quote the upstream documentation:

Includes the passed (root) node in the search.
By default, the root node does not participate in selector searches, only its children.
This behavior is logical, if you have found a node and then you want to search for other nodes in it, you don't need to check it again.
But there are cases when it is necessary for root node to participate in the search. That's what this option is for.

source

Lexbor.tag — Method

tag(node) -> Symbol | Nothing

Return the element tag name, or nothing when it is not an element.

source

Lexbor.text — Method

text(node) -> String | Nothing

Return the text content of a node, or nothing when the node is not a valid text node.

source

Lexbor.jl

Usage

Parsing HTML

Querying documents

Iteration

Matching Nodes

API

Matching `Node`s