Skip to content

[RFC]: Hamming distance between two strings #836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 of 7 tasks
Planeshifter opened this issue Feb 1, 2023 · 14 comments
Open
4 of 7 tasks

[RFC]: Hamming distance between two strings #836

Planeshifter opened this issue Feb 1, 2023 · 14 comments
Labels
Accepted RFC feature request which has been accepted. difficulty: 3 Likely to be challenging but manageable. Feature Issue or pull request for adding a new feature. Good First Issue A good first issue for new contributors! JavaScript Issue involves or relates to JavaScript. priority: Low Low priority concern or feature request. RFC Request for comments. Feature requests and proposed changes. Utilities Issue or pull request concerning general utilities.

Comments

@Planeshifter
Copy link
Member

Planeshifter commented Feb 1, 2023

Description

This RFC proposes adding a function to calculate the Hamming distance between two strings.

The function should have the following signature (a: string, b: string): number.

The function should take two strings as arguments and return the Hamming distance between them. The Hamming distance is defined as the number of characters that have to be changed to convert one string to the other. Since it only allows substitutions, it can only be used to compare strings of the same length.

Additionally, in order to account for code points and grapheme clusters, we should add separate packages for dealing with each, as the underlying algorithms are likely to differ. We then can provide a more general API which unifies the underlying algorithms. Accordingly, we should create the following packages:

  • @stdlib/string/base/distances/hamming: compares UTF-16 code units.
  • @stdlib/string/base/distances/hamming-code-points: compares Unicode code points.
  • @stdlib/string/base/distances/hamming-grapheme-clusters: compares grapheme clusters (i.e., visual characters)

Once the above are completed, we can add

  • @stdlib/string/distances/hamming: unifies the above "base" packages and provides an option for specifying the computation "mode" (i.e., code_units, code_points, or grapheme_clusters, with grapheme_clusters being the default).

Related Issues

Related issues #151.

Questions

No.

Other

No.

Checklist

  • I have read and understood the Code of Conduct.
  • Searched for existing issues and pull requests.
  • The issue name begins with RFC:.
@Planeshifter Planeshifter added RFC Request for comments. Feature requests and proposed changes. Feature Issue or pull request for adding a new feature. Good First Issue A good first issue for new contributors! labels Feb 1, 2023
@Ahtaxam

This comment was marked as outdated.

@kgryte

This comment was marked as outdated.

rgizz added a commit to rgizz/stdlib that referenced this issue Nov 21, 2023
kgryte added a commit that referenced this issue Dec 15, 2023
PR-URL: 	#1166
Co-authored-by: Athan Reines <[email protected]>
Reviewed-by: Athan Reines <[email protected]>
Ref: #836
Ref: #151
@kgryte kgryte added difficulty: 3 Likely to be challenging but manageable. priority: Low Low priority concern or feature request. Utilities Issue or pull request concerning general utilities. Accepted RFC feature request which has been accepted. JavaScript Issue involves or relates to JavaScript. labels Feb 23, 2024
@marsian83
Copy link
Contributor

Hi, I would like to work on this
Can I be assgined this

@kgryte
Copy link
Member

kgryte commented Feb 29, 2024

@marsian83 Sure. Would you be open to working on @stdlib/string/base/distances/hamming-code-points?

@kgryte
Copy link
Member

kgryte commented Feb 29, 2024

Note that the UTF-16 code unit implementation has already been added. The code points implementation should iterate over Unicode code points (e.g., most Unicode characters and some emoji). You can find other code points implementation in the string/base/* namespace.

@marsian83
Copy link
Contributor

Note that the UTF-16 code unit implementation has already been added. The code points implementation should iterate over Unicode code points (e.g., most Unicode characters and some emoji). You can find other code points implementation in the string/base/* namespace.

Sure, I'll see how these work.
Thanks for the reference

@mayankkamboj47
Copy link

@kgryte Hello ! Can I work on the grapheme clusters part ?

@kgryte
Copy link
Member

kgryte commented Mar 2, 2024

@mayankkamboj47 Sure. You'll probably want to rely on next-grapheme-cluster-break in order to derive string indices for the purpose of comparing substrings.

@Pratik772846
Copy link
Contributor

@kgryte @Planeshifter If noone is working on @stdlib/string/base/distances/hamming-code-points, can i work on it?

@JayPokale
Copy link

@kgryte I created a pull request for Hamming distance between Unicode characters (#1948 ). Can you please review it.
Looking forward for your reply.

@JayPokale
Copy link

Hello there, can someone please review PR #1948 on @stdlib/string/base/distances/hamming-code-points.

@MynameisSanskar
Copy link

MynameisSanskar commented Nov 8, 2024

Hi @kgryte,

I am interested in contributing to the Hamming distance feature for comparing Unicode code points in the @stdlib/string/base/distances/hamming-code-points package. Could you please assign this task to me?

Looking forward to your guidance and feedback!

Thank you!``

@Anant1004
Copy link

Hello, everyone ! I'm interested in contributing to the Hamming distance feature. I noticed that parts of the implementation for different comparison modes (UTF-16 code units, Unicode code points, and grapheme clusters) have been discussed and some PRs have been submitted. Could someone please update me on the current status of this issue and let me know if there are any open tasks I could help with? Thank you !

@kgryte
Copy link
Member

kgryte commented Nov 18, 2024

@MynameisSanskar @Anant1004 As tracked in the OP, calculating the distance between two strings when comparing code points has not been completed. Initial attempt to add support can be found in #1948, but that PR stalled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted RFC feature request which has been accepted. difficulty: 3 Likely to be challenging but manageable. Feature Issue or pull request for adding a new feature. Good First Issue A good first issue for new contributors! JavaScript Issue involves or relates to JavaScript. priority: Low Low priority concern or feature request. RFC Request for comments. Feature requests and proposed changes. Utilities Issue or pull request concerning general utilities.
Projects
None yet
Development

No branches or pull requests

9 participants