Web scraping is a technique used to extract data from websites. It involves automating the process of visiting a website, parsing the HTML or XML markup, and extracting the desired information. Golang is a popular programming language that can be used for web scraping. In this tutorial, we will explore how to use the Golang library called Colly to perform web scraping.
Web scraping with Golang – Go and Colly |
To use Colly in your Golang project, you need to install it using the following command:
go
go get -u github.com/gocolly/colly/...
To create a basic Colly scraper, you need to first import the Colly library in your Golang file as follows:
go
Copy code
import (
"fmt"
"github.com/gocolly/colly"
)
After importing the library, you can create a new Colly collector and set up the callback functions that will be executed when certain events occur. In the example below, we are setting up a callback function to be executed when a page is visited:
go
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
The NewCollector() function creates a new Colly collector. The OnRequest() function sets up a callback function that will be executed when a page is visited. In this example, we are simply printing the URL of the page being visited. Finally, we call the Visit() function to start the scraping process.
To extract data from a website, you need to set up callback functions that will be executed when certain HTML elements are encountered. In the example below, we are setting up a callback function to be executed when a div element with a class attribute of post is encountered:
go
func main() {
c := colly.NewCollector()
c.OnHTML("div.post", func(e *colly.HTMLElement) {
fmt.Println(e.ChildText("h2"))
fmt.Println(e.ChildText("p"))
})
c.Visit("http://go-colly.org/")
}
The OnHTML() function sets up a callback function that will be executed when an HTML element matching the specified CSS selector is encountered. In this example, we are looking for div elements with a class attribute of post. The callback function then extracts the text content of the h2 and p elements inside the div element.
In some cases, you may want to follow links on a page to scrape additional data. To follow links, you can set up a callback function to be executed when a link is encountered, as shown in the example below:
go
func main() {
c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
c.Visit(e.Request.AbsoluteURL(link))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
In this example, we are setting up a callback function to be executed when an a element with an href attribute is encountered. The callback function then extracts the value of the href attribute and calls the Visit() function to follow the link. We are also setting up a callback function to be executed when a page is visited