How to Parse HTML from the Command Line

How to Parse HTML from the Command Line

Hey there! Ever wondered how to grab specific parts of a website's HTML right from your command line? Here's a quick guide to help you do that using some popular command-line tools. Let's jump in!

1. Basic Parsing with grep and sed

If you just need to pull out simple text like a title or keyword, grep and sed are your friends. Here’s an example of how you might grab the page title from a website.

curl -s http://example.com | grep "<title>" | sed 's/<[^>]*>//g'

This command uses:

  • curl to get the HTML from a site.
  • grep to find the line with <title> tags.
  • sed to remove those <title> tags and show only the text.

2. Parsing with xmllint (Using XPath Queries)

Want something a bit more powerful? xmllint can work like magic, especially if the HTML is neat and well-formed (meaning it follows rules like XML does). Here’s how:

curl -s http://example.com | xmllint --html --xpath "//title/text()" -

This line finds the title text in HTML by using XPath queries. It's super powerful but needs good HTML to work perfectly.

3. Querying HTML with CSS Selectors Using pup

If you're familiar with CSS and want a tool that understands it, pup is a great choice. It lets you pick HTML elements using CSS-like selectors. Let’s try it:

curl -s http://example.com | pup 'title text{}'

Here, pup uses title text{} to find the page title. It's like picking elements in CSS, so if you know CSS, this is a powerful way to work.

4. Using htmlq (Also CSS Selectors)

htmlq works a lot like pup but is inspired by another tool called jq (which works with JSON data). If you like CSS selectors, try this:

curl -s http://example.com | htmlq 'title'

Just tell htmlq what tag you want, like title, and it shows you the text inside!

5. Advanced Parsing with JSON + HTML Using jq

Sometimes websites have HTML inside JSON (a special data format). You can use jq to pull out the HTML part and then pipe it to a tool like htmlq. Example:

curl -s http://api.example.com/data | jq -r '.html_content' | htmlq 'p'

This pulls HTML content from JSON data, then finds the paragraph (p) tags in it.

Which One Should You Use?

If you’re just getting started, try pup or htmlq since they’re easy to learn with CSS selectors. For more complicated jobs, xmllint or a combo of jq and htmlq will give you more power.

Final Thoughts

Parsing HTML from the command line is a fun way to learn both HTML and command-line magic. Now you can pick the right tool for the job and start exploring web pages right from your terminal. Happy coding!

Post a Comment

0 Comments