How to Parse HTML from the Command Line
Hey there! Ever wondered how to grab specific parts of a website's HTML right from your command line? Here's a quick guide to help you do that using some popular command-line tools. Let's jump in!
1. Basic Parsing with grep
and sed
If you just need to pull out simple text like a title or keyword, grep
and sed
are your friends. Here’s an example of how you might grab the page title from a website.
curl -s http://example.com | grep "<title>" | sed 's/<[^>]*>//g'
This command uses:
- curl to get the HTML from a site.
- grep to find the line with <title> tags.
- sed to remove those <title> tags and show only the text.
2. Parsing with xmllint
(Using XPath Queries)
Want something a bit more powerful? xmllint
can work like magic, especially if the HTML is neat and well-formed (meaning it follows rules like XML does). Here’s how:
curl -s http://example.com | xmllint --html --xpath "//title/text()" -
This line finds the title text in HTML by using XPath queries. It's super powerful but needs good HTML to work perfectly.
3. Querying HTML with CSS Selectors Using pup
If you're familiar with CSS and want a tool that understands it, pup
is a great choice. It lets you pick HTML elements using CSS-like selectors. Let’s try it:
curl -s http://example.com | pup 'title text{}'
Here, pup
uses title text{}
to find the page title. It's like picking elements in CSS, so if you know CSS, this is a powerful way to work.
4. Using htmlq
(Also CSS Selectors)
htmlq
works a lot like pup
but is inspired by another tool called jq
(which works with JSON data). If you like CSS selectors, try this:
curl -s http://example.com | htmlq 'title'
Just tell htmlq
what tag you want, like title
, and it shows you the text inside!
5. Advanced Parsing with JSON + HTML Using jq
Sometimes websites have HTML inside JSON (a special data format). You can use jq
to pull out the HTML part and then pipe it to a tool like htmlq
. Example:
curl -s http://api.example.com/data | jq -r '.html_content' | htmlq 'p'
This pulls HTML content from JSON data, then finds the paragraph (p
) tags in it.
Which One Should You Use?
If you’re just getting started, try pup
or htmlq
since they’re easy to learn with CSS selectors. For more complicated jobs, xmllint
or a combo of jq
and htmlq
will give you more power.
Final Thoughts
Parsing HTML from the command line is a fun way to learn both HTML and command-line magic. Now you can pick the right tool for the job and start exploring web pages right from your terminal. Happy coding!
Dont SPAM