How to Parse HTML from the Command Line

Hey there! Ever wondered how to grab specific parts of a website's HTML right from your command line? Here's a quick guide to help you do that using some popular command-line tools. Let's jump in!

1. Basic Parsing with `grep` and `sed`

If you just need to pull out simple text like a title or keyword, grep and sed are your friends. Here’s an example of how you might grab the page title from a website.

curl -s http://example.com | grep "<title>" | sed 's/<[^>]*>//g'

This command uses:

curl to get the HTML from a site.
grep to find the line with <title> tags.
sed to remove those <title> tags and show only the text.

2. Parsing with `xmllint` (Using XPath Queries)

Want something a bit more powerful? xmllint can work like magic, especially if the HTML is neat and well-formed (meaning it follows rules like XML does). Here’s how:

curl -s http://example.com | xmllint --html --xpath "//title/text()" -

This line finds the title text in HTML by using XPath queries. It's super powerful but needs good HTML to work perfectly.

3. Querying HTML with CSS Selectors Using `pup`

If you're familiar with CSS and want a tool that understands it, pup is a great choice. It lets you pick HTML elements using CSS-like selectors. Let’s try it:

curl -s http://example.com | pup 'title text{}'

Here, pup uses title text{} to find the page title. It's like picking elements in CSS, so if you know CSS, this is a powerful way to work.

4. Using `htmlq` (Also CSS Selectors)

htmlq works a lot like pup but is inspired by another tool called jq (which works with JSON data). If you like CSS selectors, try this:

curl -s http://example.com | htmlq 'title'

Just tell htmlq what tag you want, like title, and it shows you the text inside!

5. Advanced Parsing with JSON + HTML Using `jq`

Sometimes websites have HTML inside JSON (a special data format). You can use jq to pull out the HTML part and then pipe it to a tool like htmlq. Example:

curl -s http://api.example.com/data | jq -r '.html_content' | htmlq 'p'

This pulls HTML content from JSON data, then finds the paragraph (p) tags in it.

Which One Should You Use?

If you’re just getting started, try pup or htmlq since they’re easy to learn with CSS selectors. For more complicated jobs, xmllint or a combo of jq and htmlq will give you more power.

Final Thoughts

Parsing HTML from the command line is a fun way to learn both HTML and command-line magic. Now you can pick the right tool for the job and start exploring web pages right from your terminal. Happy coding!

How to Parse HTML from the Command Line

How to Parse HTML from the Command Line

1. Basic Parsing with `grep` and `sed`

2. Parsing with `xmllint` (Using XPath Queries)

3. Querying HTML with CSS Selectors Using `pup`

4. Using `htmlq` (Also CSS Selectors)

5. Advanced Parsing with JSON + HTML Using `jq`

Which One Should You Use?

Final Thoughts

Post a Comment

Search This Blog

Recent

Popular

C Program For Fibonacci Series

Reverse a Number in PL/SQL Programming

Javascript program to find factorial of given number

PL/SQL program to generate Fibonacci series

Labels

Random Posts

Recent Posts

Popular Posts

C Program For Fibonacci Series

Reverse a Number in PL/SQL Programming

Javascript program to find factorial of given number

PL/SQL program to generate Fibonacci series

About Us

Contact form

How to Parse HTML from the Command Line

How to Parse HTML from the Command Line

1. Basic Parsing with grep and sed

2. Parsing with xmllint (Using XPath Queries)

3. Querying HTML with CSS Selectors Using pup

4. Using htmlq (Also CSS Selectors)

5. Advanced Parsing with JSON + HTML Using jq

Which One Should You Use?

Final Thoughts

You may like these posts

Post a Comment

Search This Blog

Recent

Popular

Labels

Contact form

1. Basic Parsing with `grep` and `sed`

2. Parsing with `xmllint` (Using XPath Queries)

3. Querying HTML with CSS Selectors Using `pup`

4. Using `htmlq` (Also CSS Selectors)

5. Advanced Parsing with JSON + HTML Using `jq`