Web Scraping with Java

What is Web Scraping?

Everyday a lot of time is wasted when you visit the same website again and again. What if you could write a few lines of code and get all the information you need instantly? Web scraping paves the way for a person to get what they want, when they want, from any website.

There are a lot of ways you can scrape the web. I prefer to do it with Java because, well, it’s the best language out there, period. Jokes aside, it has lots of amazing libraries out there which makes it easy to scrape with.

post

Requirements

What you need to know

You need to know the basics of Java, as well as a bit of HTML. On the Java side, you need to be familiar with the common functions, the architecture of a class etc. In HTML, you need to know what type of tags there are, what attributes are etc. I won’t be going through the basics here, I trust in your ability to do so.

Jsoup

Jsoup is one of the go-to libraries when it comes to parsing HTML, it’s easy to use, flexible,  and it has a lot of tricks up its sleeve.

I’ll be going into the basics of using Jsoup, like getting the HTML content of a website, getting an element by its ID, etc. This should give you a good idea of using Jsoup efficiently.

First off, add the Jsoup library to your project structure; you can google on how to add it if you aren’t sure. After that, you’re all set to start scraping.

  1. Getting the HTML content of a website

    Document doc = Jsoup.connect("http://www.google.com/").get();
    System.out.println(doc);
    

    This will load google and display its entire HTML. Here, Document is a class which stores the HTML content of a website.

  2. Selecting specific parts from the HTML (Parsing)

    Document doc =  Jsoup.connect("http://www.google.com/").get();
    Elements links = doc.select("a[href]");
    for(Element link:links){
        System.out.println(link.attr("href"));
    }
    

    This code will load google, and it will print out all the a tags that have an href attribute. Elements is a class which stores all the DOM elements we select from the Document object. Element is a class that stores one DOM element that we select.  In the code, if we had written Element instead of Elements, it would have only found the first matching tag and stored it. It’s important to keep in mind this difference. Here, the attr() is a method used to get the specified attribute of the element we are looping over.

    The select() method is very versatile, there are a lot of other ways to select the element we want. Here are a few other ways we can do so.

    • tagname: you can just give the tag name inside the quotes to find all the elements with that particular tag
    • .class: you can give the class name as a means of identifying the element, i.e a.someclassname
    • #id: the id name can be specified too, it will find all the elements that have the same id
    • [attr=value]: it is possible to specify the attribute and its value
    • tag#id: we can combine a specific tag and an id
    • tag.class: similarly, we can combine a class with tags
    • parent > child: when this is used, the child that is directly descended from the parent, i.e div.someclass > p, this finds the p elements that are children of the div tag having class someclass
    • There are a lot of other ways too, read this page, to learn more about it.
  3. Extracting information from the elements

    Document doc =  Jsoup.connect("http://www.google.com/").get();
    Elements links = doc.select("a[href]");
    for(Element link:links){
        System.out.println(link.text());//prints the text inside the element
        System.out.println(link.outerHtml());//prints the entire HTML of the element
                                             //including it's children
        System.out.println(link.html());//prints the inner HTML of the element
    }
    

    Here, we use some of the methods that help us getting information like the text within a tag, outer and inner HTML.
    You can get other information like the id name, class name, tagname etc. These were just a few examples of stuff you can extract.

This should be enough to get you started with Jsoup. The topics mentioned above barely scratch the surface of the things Jsoup can do like modifying the contents of a parsed HTML (used for filling forms etc.). If you are interested in the other features that Jsoup provides, you can take a look at its documentation.

Ui4j

Now let’s take a look at one of my favorite scraping libraries. Ui4j is a library built on the JavaFX WebKit Engine. It may not be as light weight as Jsoup but, it is basically a browser, meaning it does all the heavy lifting. We can use this to automate the use of web pages, meaning, we can click, fill forms and a lot more.

Similar to Jsoup, you have to add it to the project structure to start using it.

  1. Getting the HTML

    BrowserEngine browser = BrowserFactory.getWebKit();
    Page page = browser.navigate("https://www.google.com");
    Document doc = page.getDocument();
    System.out.println(doc.getBody());
    

    The above code will fire up a browser, load up a google page, store the contents of it in a Document object and displays it. The BrowswerEngine is a class where the instance of a browser is stored, it can be used to navigate to multiple pages and store them in the Page class. Now, we use the Document object to parse the HTML content, etc.

  2. Parsing the HTML

    BrowserEngine browser = BrowserFactory.getWebKit();
    Page page = browser.navigate("https://www.google.com");
    Document doc = page.getDocument();
    List<Element> aTagLinks = doc.queryAll("a[href]");
    for(Element elm : aTagLinks){
        System.out.println(elm.getOuterHTML());
    }
    

    The above program will query the HTML file for all the a tags with href attributes and prints it. Ui4j doesn’t have a class that stores multiple elements, so we have to use a list instead.

    There are a lot of other ways you can use the queryAll() method to parse the HTML, most of the ways are similar to how select() in Jsoup works. There are two methods, query() and queryAll(), query() just finds the first matching case, where queryAll() finds all the matching cases.

  3. Automating

    I’ll show you how to use this library to automate web pages. I’ll be loading google, and setting a string in the search bar, and load the result.

    BrowserEngine browser = BrowserFactory.getWebKit();
    PageConfiguration config = new PageConfiguration();
    config.setUserAgent("Mozilla/5.0 (Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML, like Gecko)Chrome/56.0.2924.87Safari/537.36");
    Page page = browser.navigate("https://www.google.com", config );
    Document doc = page.getDocument();
    doc.query("input[name='q']").get().setValue("ui4j").focus();
    Robot rob = new Robot();
    rob.keyPress(KeyEvent.VK_ENTER);
    

Google is pretty smart when it comes to preventing bots. So, I had to add a few extra lines of code to trick it into thinking that our code is an actual browser. I took advantage of Ui4j’s PageConfiguration to set the user agent of our browser to mimic a Chrome Browser running on Windows 10. Now when we navigate to the page using this configuration, it makes it easy for us to scrape. I query for an input tag with the value q and set its value as  ui4j. This is basically the same thing as typing in ui4j into the search bar. If you have noticed, when we start typing something into the search bar, it moves it to the top and it asks us to press enter. Now, since there isn’t a button to simulate a click, we have to use Java’s Robot class, to simulate an enter key click. Now we can it is up to you to use this and scrape the links from the results.

If you are interested in the different user agents out there, take a look at this

4. Some cool stuff it can do

Two things that Ui4j can do that really impressed me were its ability to show the current page and its support for executing JavaScript. When you create a Page instance, use the show() method; This will open up a headless browser and you can see if you are on the right track, in terms of form filling etc. Ui4j gives yet another option for its users to scrape more efficiently, using JavaScript. We can type in JavaScript and do things from the inside without automation. For example, instead of using the setValue() method to type in Ui4j into the search bar, we could have used a few JS commands. We can execute scripts by running the executeScript() method from an object of the Page class.

There are a lot more examples given in the GitHub page of Ui4j, go here, if you want to take a look at them.

Summing it up

In this blog, we saw what web scraping is, some of the common libraries used for web scraping in Java, and how to use them. We looked at the functionalities of both Jsoup and Ui4j.

I have worked on a few projects related to web scraping, one of which is downloading a songs from YouTube. Check it out on my GitHub page.

Advertisements

1 thought on “Web Scraping with Java”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s