Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: third lesson about DevTools #1321

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,93 @@ sidebar_position: 3
slug: /scraping-basics-python/devtools-extracting-data
---

import Exercises from './_exercises.mdx';

**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**

---

In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data. Now how do we extract the data?

## Finding product details

Previously, we've figured out how to save the subwoofer product card to a variable in the **Console**:

```js
products = document.querySelectorAll('.product-item');
subwoofer = products[2];
```

The product details are within the element as text, so maybe if we extract the text, we could work out the individual values?

```js
subwoofer.textContent;
```

That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.

![Printing text content of the parent element](./images/devtools-extracting-text.png)

We'll need to first locate relevant child elements and extract the data from each of them individually.

## Extracting title

We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.

![Finding child elements](./images/devtools-product-details.png)

JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:

```js
title = subwoofer.querySelector('.product-item__title');
title.textContent;
```

Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:

![Extracting product title](./images/devtools-extracting-title.png)

## Extracting price

To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.

![Finding child elements](./images/devtools-product-details.png)

We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the former and we'll let `querySelector()` to simply return the first result:

```js
price = subwoofer.querySelector('.price');
price.textContent;
```

It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:

![Extracting product price](./images/devtools-extracting-price.png)

But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get numbers out of them.

## Extracting URL

:::danger Work in Progress

Under development.

:::

## Extracting all URLs

:::danger Work in Progress

Under development.

:::

---

<Exercises />

:::danger Work in Progress

This lesson is under development. Please read [Extracting data with DevTools](../scraping_basics_javascript/data_extraction/devtools_continued.md) in the meantime so you can follow the upcoming lessons.
Under development.

:::
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading