Skip to content

Feature: Advanced parser utility #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
heitorlessa opened this issue Aug 28, 2020 · 10 comments
Closed

Feature: Advanced parser utility #147

heitorlessa opened this issue Aug 28, 2020 · 10 comments
Labels
feature-request feature request

Comments

@heitorlessa
Copy link
Contributor

heitorlessa commented Aug 28, 2020

Is your feature request related to a problem? Please describe.

We've heard from a small number of customers that parsing Lambda Event Source payloads require a considerable effort since these don't have official schemas for Python.

With parsing and classes modelled after these schemas, they can have the benefits of runtime type safety, custom validations on possible values pertinent to their use case, autocomplete, and only parse fields they're interested in.

Describe the solution you'd like

Solution is two-fold:

  • A new parser utility that uses Pydantic to parse and validate incoming/outgoing events, and allow customers to use their own data models
  • Pre-defined schemas and event envelopes for popular event sources, so one can apply and validate their models against where they payload is

This would reduce the amount of time developers invest searching for official data structure for each event source, improve their security posture on incoming and outgoing events, and increased developer productivity.

Describe alternatives you've considered

  • Implement simple validation using JSON Schemas as well as an extractor utility to retrieve the payload only
  • Bring Pydantic as an optional package to prevent bloating the library for those not using it

Challenge with JSON Schemas is they typically don't validate business rules for incoming/outgoing events, but merely a schema.

Additional context

Initial implementation that lacked customer data points as of now, but could be revisited depending on interest for thisfeature: #118

@heitorlessa heitorlessa added triage Pending triage from maintainers feature-request feature request and removed triage Pending triage from maintainers labels Aug 28, 2020
@jplock
Copy link

jplock commented Aug 28, 2020

I like the idea of using JSON schemas strictly for validating AWS managed event types that don’t change. Validating the payload would be the responsibility of the consumer (using pydantic or something similar).

@heitorlessa heitorlessa changed the title Advanced parser utility Feature: Advanced parser utility Aug 28, 2020
@heitorlessa heitorlessa pinned this issue Aug 28, 2020
@Nr18
Copy link

Nr18 commented Aug 31, 2020

My preference would be to use pydantic having the ability to add business rule validation would be beneficial to me.

I actually ran into the issue of having to set my date field to a string in the JSON schema because the date is optional... With pydantic, you can have an Optional[date] without having to write an additional statement just to check that the string is actually a date.

@koxudaxi
Copy link

I always use pydantic for Lambda for events and payload.
Pydantic is a very great solution for validation and parsing.

However, The library is a little heavy for lambda.
Pydantic is compiled by Cython which is created some *.so files.
I calculate these file sizes. It's about 76MB https://pypi.org/project/pydantic/#files

I suggest Pydantic is supplied as an option.
And I think powertools should provide two type validator(decorator) that are JSON Schema and Pydantic.
Users can select one.

Also, I develop a code-generator that generates pydantic models from JSON Schema.
https://github.com/koxudaxi/datamodel-code-generator
If we maintain JSON Schema then, we can get a pydantic model by the code-generator too.

Additionally, this code generator can create models from JSON data.
Last week, I create the AWS Connect event model from the event object which I get in Cloudwatch logs. It's very useful.

Backgrounds

I created an experimental project that provides pydantic models for events.
Example: https://github.com/koxudaxi/pydantic-collection/blob/master/pydantic_collection/aws/sns/models.py

I often define pydantic models for events in my project.
I would re-use the models for all projects.
But, It's difficult to create all events in my-hands.
I hope that aws-lambda-powertools-python will maintain all models.

@heitorlessa
Copy link
Contributor Author

hey @koxudaxi - This is interesting! I was under the impression only the wheels was going to count (8.2M for manylinux). It's also great to hear you created all these Pydantic tooling as I still have questions about the UX and code generation :) -- TIL.

At the moment, we're collecting customer demand on Pydantic usage within Serverless. We want to support it (see #118), but we're also mindful of justifying customer demand, Pydantic as an extra dependency, simplified UX, how much we want to abstract, and docs to ease the transition from JSON Schemas to Pydantic.

We're on the fence as to whether create a single validator utility that supports dual-modes (JSON Schema or Pydantic), or parser utility being solely focused on Pydantic use as the docs and usage will be largely different.

Would love to hear feedback on this front -- And yes, we'd be happy to maintain models for Lambda Event Sources (only), hence a longer discussion so we can get Pydantic right without breaking our Tenets

@heitorlessa
Copy link
Contributor Author

@jplock initial simple validator for JSON Schema with optional data selector (envelope) using JMESPath as an extra dependency #153

@gmcrocetti
Copy link
Contributor

This proposal is great ! It's something really useful I've been wishing for a long time.

About the description, it's unclear to me why do we need to create JSON schemas representing AWS event types instead of python annotations - don't read as critic please, I really don't know. For my use case, the killer feature of using pydantic or any other "parsing" lib is that we can deal with objects, validate/add business behavior we're unable to do in a schema. What are the use cases for JSONSchema and pydantic ?

@heitorlessa , maybe I'm going "too far" but IMO two points need to be discussed, at least in a "design" level. First one is related with the input parsing, it would be great to design something plugable, e.g, codebase of client X is entirely written in marshmallow, are we going to force him to rewrite everything to a new standard ? Maybe few cases, but I'm sure we can provide an interface to plug any parsing lib he's used to - powertools shouldn't provide them. About the second one, our utility must explicitly require the what/how to correctly parse a message. "What" is an AWS event source and the "How", a schema - looks like you've already done this in your pr.

About pydantic, I'd be great to see it here, as an extra dependency, for sure. Add it to a layer and we're "done".

@ran-isenberg
Copy link
Contributor

ran-isenberg commented Sep 1, 2020

@gmcrocetti @koxudaxi you can see my pydantic PR #118.
It also includes automatic envelope parsing for schemas for SQS, eventbridge, dynamoDb streams and custom user schemas.

@koxudaxi
Copy link

koxudaxi commented Sep 1, 2020

@heitorlessa

This is interesting! I was under the impression only the wheels was going to count (8.2M for manylinux).

It's compressed. You may be surprised when extracting it.

OK, I have understood what we should discuss in this phase.

I write about the great UX of Pydantic.
It's parsing a dict to a model.
(Of course, Pydantic can Validate input data. However, JSON Schema has the same benefit.)

Lambda Event objects are deeply and nested structures.
We have too hard to understand the structure of each service.
Also, IDEs don't support auto-completion, type-checking.
(TypedDict is a better way to static type analysis.)

Pydantic clear these problems.
We can access nested attributes in Pydantic models easily.
I feel the way is the best practice to treat Lambda Event Objects.

I want to hear other "customers" too.

@koxudaxi
Copy link

koxudaxi commented Sep 1, 2020

@risenberg-cyberark
Great work!!
I will review the PR when the PR will be unlocked.

@heitorlessa heitorlessa added the pending-release Fix or implementation already in dev waiting to be released label Oct 2, 2020
@heitorlessa heitorlessa unpinned this issue Oct 25, 2020
@heitorlessa
Copy link
Contributor Author

Everyone - This is now available in the 1.7.0 release.

@heitorlessa heitorlessa removed the pending-release Fix or implementation already in dev waiting to be released label Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request feature request
Projects
Development

No branches or pull requests

6 participants