WEB SCRAPING - Home page

Web Scraping is a tricky task, and the complexity increases if the site is dynamic. According to the United Nations Global Website Accessibility Audit, over 80% of websites are dynamic in nature and use JavaScript for their functionality.

It should be noted that when the Internet Explorer library is selected (included in the ctrl + h window), the code of WEB documents will be loaded using the Internet Explorer browser, which will automatically execute scripts and load additional content itself. But the choice of Internet Explorer reduces the speed of parsing, and further we will consider options for downloading data without it. Also, this method allows you to launch Octoparse application event lists, which can simulate scrolling down web pages, clicks on various page elements, and so on You can read more about what a dynamic site is here.

We can see that the scraper cannot clean up the information from the dynamic website because the data is loaded dynamically using JavaScript. In such cases, we can use the following two methods to cleanse data from dynamic JavaScript dependent websites:

Reverse engineering JavaScript
JavaScript rendering

Suppose the site content is loaded by ajax after the page itself has already loaded. There are several ways to get dynamic content, the main one is to write a function that checks for the presence of data at intervals. I used this option everywhere, but if the site required a lot of sequential actions, then the code became more and the more difficult it was to maintain it.