Skip to content

[RFC]: implement a broader range of statistical distributions #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
7 tasks done
vivekmaurya001 opened this issue Mar 26, 2025 · 6 comments
Open
7 tasks done
Labels
2025 2025 GSoC proposal. received feedback A proposal which has received feedback. rfc Project proposal.

Comments

@vivekmaurya001
Copy link

vivekmaurya001 commented Mar 26, 2025

Full name

Vivek Maurya

University status

Yes

University name

Indian Institute of Technology (Banaras Hindu University), Varanasi

University program

Bachelor of Technology in Electronics and communication Engineering

Expected graduation

Aug 07, 2027

Short biography

I'm currently a 2nd-year undergraduate student, pursuing Electronics Engineering. I had an interest in the development field since my first year of college and have completed several relevant courses in my academic curriculum, including C Language, Data Structures and Algorithms, Probability and Statistics.
Apart from this, I have done extensive work in JavaScript, TypeScript, React, Node.js, and MongoDB, developing projects in these technologies and winning multiple hackathons in my college. I also have a strong interest in blockchain technology and competitive programming.

Timezone

Indian Standard Time (UTC +5:30)

Contact details

email:- [email protected],[email protected], github:- vivekmaurya001

Platform

Linux

Editor

VSCode is my preferred code editor since I began my development journey because of ease of usage, support for large number of languages and many usefull extensions for git and docker , Also Debugging becomes a lot easier in VSCode

Programming experience

I have about 1.5 years of programming experience, during which I have developed many solo and group projects. Over this period, I have gained a strong foundation in React, Node.js, MongoDB, C/C++, JavaScript, Python, and backend development.

Here are some of my hackathon-winning projects:-

  1. ChatBuddy – Won in a college hackathon
  • I built this during a college hackathon. It was my first full-stack project, made it utilising React, Socket.io, chakra UI , Node.js, Express.js, and MongoDB.
  1. CryptoBox – Created for InterIIT selection.
  • It is essentially a crypto price dashboard that fetches real-time price data through an oracle and supports multiple wallet connections.
  1. ChainGamble – Won in a hackathon conducted by Euclid Protocol.
  • It is a platform like stake but with unified liquidity pool, and I contributed by utilizing the Euclid Protocol API and handling the project migration to Vite.

JavaScript experience

I started my backend development journey with JavaScript, and as I worked on more projects, I gradually gained confidence in the language. Initially, my focus was on building APIs and handling server-side logic using Node.js. Over time, I explored asynchronous programming, database interactions, and performance optimizations, which deepened my understanding of JavaScript beyond just scripting.

My contributions to stdlib further expanded my JavaScript expertise to a much greater level. Working on statistical and mathematical functions and blas implementations helped me to write efficient, structured, and performance-oriented code.

Node.js experience

While learning backend development, I was also exploring Node.js, and my understanding of it strengthened significantly through hands-on projects. Initially, I started with basic server-side scripting, but as I worked on more complex applications, I became familiar with asynchronous programming, event-driven architecture, and working with APIs.

C/Fortran experience

C was my first computer language, which I learned in my first year of college. I was very good at it. I also completed a course on data structures and algorithms in C and did competitive coding in C language in my first year, then shifted to C++. I don't have much knowledge of Fortran but Im open to it for learning when needed.

Interest in stdlib

I was curious about the Node modules I used numerous times in my projects—what their structure was, how they could be built, this ended when I found stdlib in December ! . Working with stdlib gave me the opportunity to contribute to such a recognized library, which have over millions of downloads !.

What stood out to me the most was how well-organized everything is—the clear documentation, structured workflow, and attention to detail. It’s not just about writing functions; it’s about making sure they are implemented properly, run efficiently, and are thoroughly tested to maintain high quality.

After working with stdlib for over three months, I found myself focusing on quality contributions over quantity. The PR reviewing process with maintainers helped me improve a lot.

Version control

Yes

Contributions to stdlib

I have contributed in stdlib in multiple areas like adding constants in float32, adding functions in math/base/special, adding C implementation of distributions stats/base/dists, updating native addons from C++ to C in stats/base, adding assert functions in math/base/assert, adding accessor array support to functions in stats/base, adding wasm package to blas functions below is a list of my different PR's :-

Merged:

  • 🔗 stdlib#3333 feat: add constants/float32/ln-half
  • 🔗 stdlib#3374 feat: add math/base/special/heavisidef
  • 🔗 stdlib#4118 feat: add C implementation for stats/base/dists/invgamma/stdev
  • 🔗 stdlib#4270 refactor: update stats/base/dstdev native addon from C++ to C
  • 🔗 stdlib#4183 refactor: update math/base/assert/is-even to follow latest project conventions
  • 🔗 stdlib#4214 feat: add math/base/assert/is-probabilityf
  • 🔗 stdlib#4765 refactor: update math/base/special/hypot to follow latest project conventions
  • 🔗 stdlib#5017 refactor: update math/base/special/kernel-tan
  • 🔗 stdlib#5335 feat: add support for accessor arrays and refactor stats/base/cumin
  • 🔗 stdlib#5634 feat: add blas/ext/base/wasm/dapxsum

open:

  • 🔗 stdlib#5916 feat: add stats/base/dists/burr-type3/cdf
  • 🔗 stdlib#5801 feat: add stats/base/dists/burr-type3/pdf
  • 🔗 stdlib#5777 feat: add blas/ext/base/wasm/dnanasumors
  • 🔗 stdlib#3365 feat: add math/base/special/gammasgnf

Link to all my merged and open PR's

stdlib showcase

Signal Transform:- In this project, I have shown how to use the standard library (stdlib) to efficiently perform the Discrete Fourier Transform (DFT) on time-domain signals. It demonstrates how stdlib can be used to process and visualize signals. More specifically, it includes:
• Working with complex numbers
• Handling double-precision floating-point numbers
• Generating signals using special math functions
This helps in understanding how signals can be transformed and analyzed.

Goals

Goal is to implement all important, continuous discrete, multivariate statistical distributions found in scipy into stdlib and their API’s for random number generation.
After successfully completing this project we will be having a wide varity of distributions with their parameters:- PDF, CDF, mean, median, mode, logpdf, logcdf, mgf, entropy, kurtosis, skewness, variance etc in stats/base/dists Additionally, APIs will be available for generating random samples from any implemented distribution.
Throughout the implementation, I will ensure that quality and performance remain a top priority

Here is my work plan :-

Parameters to be Implemented
• Core Functions: CDF, PDF, Mean, Median, Mode
• Advanced Metrics: Log PDF, Log CDF, MGF, Quantile
• Statistical Properties: Variance, Standard Deviation, Skewness, Kurtosis
• Entropy Calculation

I have worked on classifying all distributions in scipy Statistical functions into 3 categories:-

  1. distributions which can be easily implemented with current functionality
  2. distributions which are partially implementable
  3. distributions which requires complex implementation

In second type some distributions require dependencies not developed yet, majorly hindering mgf and entropy calculation
, For these I will see if there simple closed form exists or they can be done using some approximations, If not other parameters can still be implemented.

In third type I have found major blockers to be in multivariate distributions requiring some basic matrix functionalities like :- Covariance Matrix Calculation, Determinant, Transpose, Matrix Multiplication, Addition, Subtraction, Division, Inverse Calculation, Trace calculation, Kronecker Product.

As there is a large number of distributions in scipy so I have highlighted those which have a broader range of usage or very usefull in physical applications. For each distribution I have searched the number of important physics domain it comes in like maxwell distribution comes in Statistical Mechanics, Thermodynamics, Fluid Dynamics , Nakagami distribution comes in Wireless Communications, Signal Processing etc

A much more detailed explanation of my classification and blockers is in below pdf

distribution implemetation (1).pdf

Implementation Plan:-

  • A rough plan is I will be impementing all highlighted distributions in type 1 1st , then type 2
    then type 3 then their random API’s

  • Before implementing any distribution , I will thhorogly read about its parametrs and understand and for reference I will be following these sources
    1 - Scipy
    2 - numpy
    3 - julia
    4 - R stats

  • I will start with type 1 highlighted continous distribution then discrete then multivariate and
    make sure to complete it by week 4.

  • Moving to Type 2 starting with continous probability I will execute those packages whose mostly parameters can be implemented , If the work on functionality needed, is done till that time then no problem otherwise I will discuss it with mentors and find out some alternative way for execution without comprimising the quality and performance , try to make it complete in week 5-6.

  • starting with type 3 following same continous then multivariate , if the work on blockers is done by that time then no problem otherwise I will start to work on implemented distribution’s random API’s.

  • For random API’s I will first see that if any specific performant method exist for the distribution otherwise I will try some basic methods like inverse transform sampling etc I will be following the same order as implemented distributions.

If the work on random API’s got over early I will be implementing remaining distributions.

Why this project?

The proposed project is a very exciting opportunity to delve deep into statistical computations, focusing on implementing a wide range of probability distributions. This project will not only enhance my understanding of statistical modeling but also contribute to the broader JavaScript ecosystem. The key motivations behind my proposal are:

  • The thing exicites me most is that I will be contributing to a such huge project which will be used by millions of users
  • In my third semester I have probability and statistics as my coursework the things I learnt there I will be doing the same thing in this project , Also my background in mathematics will help me keep this motivation throughout this project
  • Implementing distributions requires reading research papers, understanding mathematical derivations, and translating them into code. This will sharpen my ability to break down complex problems and implement efficient solutions. Also this knowledge is invaluable for fields like data science, finance, and scientific computing.
  • Engaging with mentors and fellow contributors will not only enhance my technical skills but also sharpen my problem-solving abilities and improve my communication and teamwork skills.

Qualifications

For this project, I will need a good understanding of JavaScript and statistical analysis. As mentioned earlier, I have been practicing development for over 1.5 years, working on full-stack projects that have given me in-depth knowledge of JavaScript concepts. Additionally, my coursework in Probability and Statistics during college has provided me with a solid foundation, which will be valuable for this project. I also feel confident in applying what I have learned in real-world scenarios.
Contributing to stdlib for over three months has helped me improve the quality of my PRs and refine my approach to coding. It has also allowed me to focus on areas I am truly interested in, making the experience both valuable and rewarding.

Also, I have done work on implementing burr distribution here are links :-

  • 🔗 stdlib#5916 feat: add stats/base/dists/burr-type3/cdf
  • 🔗 stdlib#5801 feat: add stats/base/dists/burr-type3/pdf
  • 🔗 stdlib#6394 feat: add stats/base/dists/burr-type3/logcdf

Prior art

Some basic distributions have already been implemented in stdlib, Now with the help of these many other distributions can be implemented.

Implementations of such statistical functions can be found in libraries such as SciPy and R's stats package, which offer a vast collection of probability distributions, along with functions for PDF, CDF, quantile functions, and random sampling.

Commitment

  • I will be working approximately 30 to 35 hours per week during the active coding period.
  • May 1 to June 30: My university exams will conclude by May 9, after which I will begin contributing after that and as my semester will be over so I will have enough time for contributing.
  • July 1 to August 31: I will dedicate 20 hours per week during this period as my new semester will start.

Additionally, once my exams are over, I plan to start working during the community bonding period
to ensure the project stays on schedule and I meet all milestones.

Considering it a Large project so, project length will be around 350 hours or more.

Schedule

Assuming a 12 week schedule,

  • Community Bonding Period: will start implementing highlighted independent distributions like Burr , Burr12 , dgamma, exponweib etc as mentioned in doc in type 1

  • Week 1 to Week 4 :- In this period i will be completing all distributions of type 1, As there are many distributions in this it will take time to properly complete it.

  • Week 5 to week 6 :- Starting with type 2 highlighted distributions , As it will be requiring some funtionality to be developed so more reasearch are required for this , while other parameters which can still be implemented.

  • Week 7 to Week 10 :- In this I will be finishing off any remaining work in previous weeks also Starting with Type 3 dists and after completing it, will start implementing Random API’s and try to finish it off by week 10

  • Week 11 to Week 12 :- After completing type 3 and random API’s , start to execute remaining distribution in same order and try to do whatever i can do in these weeks

  • Final Week: try completing any remaining distribution by this week.

  • Post Gsoc :- After these 12 weeks I will continue contributing if some distributions still left I will be completing those according to time availability

Notes:

  • The community bonding period is a 3 week period built into GSoC to help you get to know the project community and participate in project discussion. This is an opportunity for you to setup your local development environment, learn how the project's source control works, refine your project plan, read any necessary documentation, and otherwise prepare to execute on your project project proposal.
  • Usually, even week 1 deliverables include some code.
  • By week 6, you need enough done at this point for your mentor to evaluate your progress and pass you. Usually, you want to be a bit more than halfway done.
  • By week 11, you may want to "code freeze" and focus on completing any tests and/or documentation.
  • During the final week, you'll be submitting your project.

Related issues

Issue #2

Checklist

  • I have read and understood the Code of Conduct.
  • I have read and understood the application materials found in this repository.
  • I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
  • I have read and understood the patch requirement which is necessary for my application to be considered for acceptance.
  • The issue name begins with [RFC]: and succinctly describes your proposal.
  • I have read and understood the stdlib showcase requirement which is necessary for my application to be considered for acceptance.
  • I understand that, in order to apply to be a GSoC contributor, I must submit my final application to https://summerofcode.withgoogle.com/ before the submission deadline.
@vivekmaurya001 vivekmaurya001 added 2025 2025 GSoC proposal. rfc Project proposal. labels Mar 26, 2025
@kgryte
Copy link
Member

kgryte commented Mar 31, 2025

@vivekmaurya001 Thank you for opening this RFC. A few comments/questions:

  1. For type 2 dists, you list a number of missing special functions. Do you have an idea of how complex these would be to add? If they are involved, that could result in a delay in being able to make much progress on these dists.
  2. You've highlighted various distributions in the attached PDF. Can you explain a bit more why you want to focus on those distributions? E.g., are they more heavily used? If so, how did you determine why those distributions should take priority?

@vivekmaurya001
Copy link
Author

vivekmaurya001 commented Mar 31, 2025

Thanks for reviewing @kgryte !
As special math functions are required in only some distributions they won't hinder the progress much, also some are pretty simple to implement so I will try to implement them independently in dists, the thing that will be required is integration in entropy and mgf calculations majorily, for these I will look for if their simple closed form exists or they can be implemented using some approximations , If these not works I will continue to implement other important parameters and dists
For highlighted distributions , actually for me there is not any clear way to tell whether they are heavily used or not , Also I didn't found any list for important distributions, mentioned in scipy , So I read about each distribution and simply classified each of them with number of different domains they are used in like :- High energy physics, nuclear physics, radiation physics, particle physics, thermodynamics, fluid dynamics, material science, signal processing , wireless communication etc. More the number of domains it is in more its priority

Although this classification in domains might not be very correct, But in the end I will try to implement those distributions also which I not highlighted which is required for linked issue

@vivekmaurya001
Copy link
Author

Also, I was thinking of highlighting those distributions which are common in scipy , r stat and scipy , julia , So it would be best of both worlds, Whats your opinion on this ?

@kgryte
Copy link
Member

kgryte commented Apr 3, 2025

Also, I was thinking of highlighting those distributions which are common in scipy , r stat and scipy , julia

Yes, that seems reasonable.

@kgryte kgryte added received feedback A proposal which has received feedback. and removed needs feedback labels Apr 3, 2025
@vivekmaurya001
Copy link
Author

@kgryte , Any furthur suggestions or improvements from your side ?

@kgryte
Copy link
Member

kgryte commented Apr 7, 2025

Nothing else on my end!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2025 2025 GSoC proposal. received feedback A proposal which has received feedback. rfc Project proposal.
Projects
None yet
Development

No branches or pull requests

2 participants