Web scraping is a useful technique for extracting data from websites and converting it into a structured format. AWS Lambda, combined with Python, provides a scalable, serverless solution for this task. In this guide, we’ll set up a Lambda function to scrape text from an e-commerce site and convert it into a structured JSON output.
Create a Lambda Function:
ScrapeEcommerceData
.Add Required Permissions:
Install Dependencies:
Lambda Function Code:
Here is an example Lambda code for scraping product information from an e-commerce page:
import json
import requests
from bs4 import BeautifulSoup
def lambda_handler(event, context):
# URL of the e-commerce page to scrape
url = "https://example-ecommerce-site.com/product-page"
# Send an HTTP GET request to fetch the page content
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extract product details (e.g., title, price, and description)
product_title = soup.find('h1', class_='product-title').text.strip()
product_price = soup.find('span', class_='product-price').text.strip()
product_description = soup.find('div', class_='product-description').text.strip()
# Structure the extracted data as JSON
product_data = {
"title": product_title,
"price": product_price,
"description": product_description
}
# Return the JSON response
return {
'statusCode': 200,
'body': json.dumps(product_data)
}
else:
return {
'statusCode': response.status_code,
'body': json.dumps({"error": "Failed to retrieve the page"})
}
Create a Test Event:
Run the Test:
#### **Sample JSON Output**:
{
"title": "Wireless Bluetooth Headphones",
"price": "$59.99",
"description": "High-quality Bluetooth headphones with noise-cancellation and up to 20 hours of battery life."
}
Using AWS Lambda for web scraping offers a serverless and scalable way to extract and structure data. This solution can be expanded to handle large-scale data extraction tasks, integrate with AWS storage services, or trigger additional workflows based on the data collected.