How to Use AWS Lambda to Scrape Text and Create Structured JSON

Introduction: Automating Data Extraction with AWS Lambda

Web scraping is a useful technique for extracting data from websites and converting it into a structured format. AWS Lambda, combined with Python, provides a scalable, serverless solution for this task. In this guide, we’ll set up a Lambda function to scrape text from an e-commerce site and convert it into a structured JSON output.

Step-by-Step Setup

Step 1: Set Up AWS Lambda

Create a Lambda Function:
- Navigate to the AWS Management Console and open AWS Lambda.
- Click Create Function and select Author from scratch. Give it a name like ScrapeEcommerceData.
- Choose Python as the runtime.
Add Required Permissions:
- Create an IAM Role with basic Lambda execution permissions.
- If your Lambda function needs to access specific services or networks, grant additional permissions.

Step 2: Write the Lambda Function Code

Install Dependencies:
- Use libraries such as requests to send HTTP requests and BeautifulSoup from bs4 to parse the HTML content.
Lambda Function Code:

Here is an example Lambda code for scraping product information from an e-commerce page:

import json
import requests
from bs4 import BeautifulSoup

def lambda_handler(event, context):
    # URL of the e-commerce page to scrape
    url = "https://example-ecommerce-site.com/product-page"

    # Send an HTTP GET request to fetch the page content
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract product details (e.g., title, price, and description)
        product_title = soup.find('h1', class_='product-title').text.strip()
        product_price = soup.find('span', class_='product-price').text.strip()
        product_description = soup.find('div', class_='product-description').text.strip()

        # Structure the extracted data as JSON
        product_data = {
            "title": product_title,
            "price": product_price,
            "description": product_description
        }

        # Return the JSON response
        return {
            'statusCode': 200,
            'body': json.dumps(product_data)
        }
    else:
        return {
            'statusCode': response.status_code,
            'body': json.dumps({"error": "Failed to retrieve the page"})
        }

Explanation:

The function uses requests to send an HTTP GET request to the e-commerce URL.
BeautifulSoup extracts specific elements like the product title, price, and description based on their HTML tags and classes.
It structures the extracted data into a JSON format for further use.

Step 3: Test the Lambda Function

Create a Test Event:
- In the Lambda console, click Test and create a new test event with basic JSON input (since the code doesn’t rely on event data).
Run the Test:
- Execute the test and verify the returned JSON structure in the execution results.

#### **Sample JSON Output**:

{
    "title": "Wireless Bluetooth Headphones",
    "price": "$59.99",
    "description": "High-quality Bluetooth headphones with noise-cancellation and up to 20 hours of battery life."
}

Conclusion: A Scalable and Serverless Scraping Solution

Using AWS Lambda for web scraping offers a serverless and scalable way to extract and structure data. This solution can be expanded to handle large-scale data extraction tasks, integrate with AWS storage services, or trigger additional workflows based on the data collected.

Published Oct 31, 2024

Welcome to Vians TechVians Tech on Twitter