Skip to content

Commit 4680b33

Browse files
mokuhasushiAntonio Tirone
and
Antonio Tirone
authored
feat: migrate mnq sqs tutorial source code to serverless examples (#70)
* feat: migrate mnq sqs tutorial source code to serverless examples * feat: zip archives via terraform, tf outputs --------- Co-authored-by: Antonio Tirone <[email protected]>
1 parent 4c0ef15 commit 4680b33

File tree

9 files changed

+408
-1
lines changed

9 files changed

+408
-1
lines changed

README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,8 @@ Table of Contents:
8282
| **[Kong API Gateway](projects/kong-api-gateway/README.md)** <br/> Deploying a Kong Gateway on containers to provide routing to functions. | CaaS & FaaS | Python | [Serverless Framework] |
8383
| **[Serverless Gateway](https://github.com/scaleway/serverless-gateway)** <br/> Our serverless gateway for functions and containers. | API Gateway | Python | [Python API Framework] |
8484
| **[Monitoring Glaciers](projects/blogpost-glacier/README.md)** <br/> A project to monitor glaciers and the impact of global warming. | S3 & RDB | Golang | [Serverless Framework] |
85-
| **[Manage large message](projects/large-messages/README.md)** <br/> An example of infrastructure to manage large messages. | PaaS & S3 | Python | [Terraform] |
85+
| **[Manage large message](projects/large-messages/README.md)** <br/> An example of infrastructure to manage large messages. | PaaS & S3 | Python | [Terraform] |
86+
| **[Serverless scraping](projects/serverless-scraping/README.md)** <br/> An example of infrastructure to scrape the hackernews website. | PaaS & RDB | Python | [Terraform] |
8687

8788
[Serverless Framework]: https://github.com/scaleway/serverless-scaleway-functions
8889
[Terraform]: https://registry.terraform.io/providers/scaleway/scaleway/latest/docs
+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
venv/
2+
.env
3+
*.zip
4+
package/
5+
6+
# terraform
7+
**/.terraform/*
8+
9+
*.tfstate
10+
*.tfstate.*
11+
12+
crash.log
13+
crash.*.log
14+
15+
*.tfvars
16+
*.tfvars.json
+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Create a serverless scraping architecture
2+
3+
This is the source code for the tutorial: [Create a serverless scraping architecture, with Scaleway Messaging and Queuing SQS, Serverless Functions and Managed Database](https://www.scaleway.com/en/docs/tutorials/create-serverless-scraping).
4+
5+
In this tutorial we show how to set up a simple application which reads [Hacker News](https://news.ycombinator.com/news) and processes the articles it finds there asynchronously, using Scaleway serverless products.
6+
7+
## Requirements
8+
9+
This example assumes you are familiar with how serverless functions work. If needed, you can
10+
check [Scaleway official documentation](https://www.scaleway.com/en/docs/serverless/functions/quickstart/)
11+
12+
This example is written using Python and Terraform, and assumes you have [set up authentication for the Terraform provider](https://registry.terraform.io/providers/scaleway/scaleway/latest/docs#authentication).
13+
14+
15+
## Context
16+
17+
**The architecture deployed in this tutorial consists of two functions, two triggers, a SQS queue, and a RDB instance.**
18+
*The producer function, activated by a recurrent cron trigger, scrapes HackerNews for articles published in the last 15 minutes and pushes the title and URL of the articles to an SQS queue created with Scaleway Messaging and Queuing.*
19+
*The consumer function, triggered by each new message on the SQS queue, consumes messages published to the queue, scrapes some data from the linked article, and then writes the data into a Scaleway Managed Database.*
20+
21+
22+
## Setup
23+
Once you have cloned this repository, you just need to deploy using Terraform.
24+
```bash
25+
terraform init
26+
terraform apply
27+
```
28+
29+
30+
## Running
31+
32+
Everything is already up and running!
33+
You can check correct execution by using the Scaleway cockpit, and by connecting to your RDB instance to see results.
34+
35+
```bash
36+
psql -h $(terraform output -raw db_ip) --port $(terraform output -raw db_port) -d hn-database -U worker
37+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Ignore everything in this directory except this file
2+
!.gitignore
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import json
2+
import os
3+
import pg8000.native
4+
import requests
5+
from bs4 import BeautifulSoup
6+
7+
db_host = os.getenv('DB_HOST')
8+
db_port = os.getenv('DB_PORT')
9+
db_name = os.getenv('DB_NAME')
10+
db_user = os.getenv('DB_USER')
11+
db_password = os.getenv('DB_PASSWORD')
12+
13+
CREATE_TABLE_IF_NOT_EXISTS = """
14+
CREATE TABLE IF NOT EXISTS articles (
15+
id SERIAL PRIMARY KEY,
16+
title VARCHAR(255) NOT NULL,
17+
url VARCHAR(255) NOT NULL,
18+
a_count INTEGER NOT NULL,
19+
h1_count INTEGER NOT NULL,
20+
p_count INTEGER NOT NULL
21+
);"""
22+
23+
INSERT_INTO_ARTICLES = """
24+
INSERT INTO articles (title, url, a_count, h1_count, p_count)
25+
VALUES(:title, :url, :a_count, :h1_count, :p_count) RETURNING id
26+
;"""
27+
28+
def scrape_page_for_stats(url):
29+
"""
30+
Scrape page at given url and return stats about chosen tags
31+
"""
32+
# articles hosted on hn have a relative url
33+
if url[:4] == "item":
34+
url = "https://news.ycombinator.com/" + url
35+
36+
page = requests.get(url, timeout=15)
37+
html_doc = page.content
38+
soup = BeautifulSoup(html_doc, 'html.parser')
39+
40+
tags = ['a', 'h1', 'p']
41+
42+
return {tag: len(soup.find_all(tag)) for tag in tags}
43+
44+
def scrape_and_save_to_db(event):
45+
"""
46+
Scrape a page for info and save such infos in db
47+
"""
48+
body = json.loads(event["body"])
49+
50+
tags_count = scrape_page_for_stats(body['url'])
51+
conn = None
52+
try:
53+
conn = pg8000.native.Connection(host=db_host, database=db_name, port=db_port, user=db_user, password=db_password, timeout=15)
54+
55+
# Where else could we create the table, to avoid manual intervention?
56+
conn.run(CREATE_TABLE_IF_NOT_EXISTS)
57+
conn.run(INSERT_INTO_ARTICLES, title=body['title'], url=body['url'], a_count=tags_count['a'], h1_count=tags_count['h1'], p_count=tags_count['p'])
58+
59+
finally:
60+
if conn is not None:
61+
conn.close()
62+
return 200
63+
64+
def handle(event, context):
65+
try:
66+
status = scrape_and_save_to_db(event)
67+
return {'statusCode': status, 'headers': {'content': 'text/plain'}}
68+
except Exception as e:
69+
print("error", e)
70+
return {'statusCode': 500, 'body': str(e)}
71+
72+
if __name__ == '__main__':
73+
handle({'body': json.dumps({'url': 'https://google.com', 'title': 'test url'})}, None)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
pg8000
2+
requests
3+
bs4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import requests
2+
import boto3
3+
import json
4+
import os
5+
from datetime import datetime, timedelta
6+
from bs4 import BeautifulSoup
7+
8+
HN_URL = "https://news.ycombinator.com/newest"
9+
SCW_SQS_URL = "https://sqs.mnq.fr-par.scaleway.com"
10+
11+
queue_url = os.getenv('QUEUE_URL')
12+
sqs_access_key = os.getenv('SQS_ACCESS_KEY')
13+
sqs_secret_access_key = os.getenv('SQS_SECRET_ACCESS_KEY')
14+
15+
def scrape_and_push():
16+
"""
17+
Scrape the HN website for articles published in the last 15 minutes, and push infos on the SQS queue
18+
"""
19+
page = requests.get(HN_URL)
20+
html_doc = page.content
21+
22+
soup = BeautifulSoup(html_doc, 'html.parser')
23+
24+
# On hn news page there are exactly 30 articles, for each of them a `titleline` and a `age` span are present
25+
titlelines = soup.find_all(class_="titleline")
26+
ages = soup.find_all(class_="age")
27+
28+
sqs = boto3.client('sqs', endpoint_url=SCW_SQS_URL, aws_access_key_id=sqs_access_key, aws_secret_access_key=sqs_secret_access_key, region_name='fr-par')
29+
30+
for age, titleline in zip(ages, titlelines):
31+
time_str = age["title"]
32+
time = datetime.strptime(time_str, "%Y-%m-%dT%H:%M:%S")
33+
# Check if article was published in the last 15 minutes
34+
if datetime.utcnow() - time > timedelta(minutes=15):
35+
continue
36+
37+
body = json.dumps({'url': titleline.a["href"], 'title': titleline.a.get_text()})
38+
response = sqs.send_message(QueueUrl=queue_url, MessageBody=body)
39+
40+
return page.status_code
41+
42+
def handle(event, context):
43+
try:
44+
status = scrape_and_push()
45+
return {'statusCode': status, 'headers': {'content': 'text/plain'}}
46+
except Exception as e:
47+
print(e)
48+
return {'statusCode': 500, 'body': str(e)}
49+
50+
if __name__ == "__main__":
51+
handle(None, None)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
boto3
2+
bs4
3+
requests

0 commit comments

Comments
 (0)