If you are familiar with building a web scraper using Nodejs or Python then you can follow this article easily. But if you are not, it may not be easy to follow, so I will try to make it simple in the explanation.
Add Jsoup in our Gradle project
As usual, first thing first we create our Gradle project. This time we will add Jsoup library as its dependencies.
Hmm what is Jsoup? According to its website — https://jsoup.org, Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
And then here is our build.gradle looks like.
Our source files structure
| - main
| | - java
| | - kotlin
| | | - verticle
| | | | - HttpServerVerticle.kt
| | | | - MainVerticle.kt
| | | - App.kt
App.kt is main function that will run the service.
HttpServerVerticle.kt is the class which our REST API code will be written here.
MainVerticle.kt is main Vert.x instance that will run HttpServerVerticle class.
Implementing the web scraper in our service
We will implement the web scraper using Jsoup in our HttpServerVerticle class.
As example, you can see on line 82.
val document = Jsoup.connect("https://www.geekwire.com/startups/").get()
We call connect() method and put the site address as its parameter. After that we call get() method to get the whole HTML DOM.
Then on line 83, we will make a query using css selector.
val articleElements = document.select("article.type-post")
The select() method will find the elements that match with the selector. Yes, it will return an array. So we will iterate it in the next line.
On the next line 87, we see code like this.
val title = articleElement.selectFirst("h2.entry-title > a")
It uses selectFirst() method to get only one element that matches the selector. So we have to make sure if that selector is only one inside one of the articleElements.
Then on line 94, we store the data we need that has been scraped into JsonArray object.
MainVerticle.kt and App.kt
These are the standard class and method to deploy our service in Vert.x.
Compile everything that we have and run the web scraper
Once you have finished to write those code, then build it, run the gradle task of cleanAndJar.
If you already had your jar file, just simply run:
$ java -jar build/libs/WebScraper-1.0.0.jar
Now is the time to test our web scraper API. I use Postman to test the result of what we have built.
That’s all. You can try to find another idea to scrap website. If the data is usefull enough, maybe you can publish your own API and try to monetize it. Sounds interesting right?
The complete source is on https://github.com/merizrizal/vertx-webscraper-api.