Skip to content

mhchem.js update - first version - request for discussion #1414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

mhchem.js update - first version - request for discussion #1414

wants to merge 5 commits into from

Conversation

mhchem
Copy link

@mhchem mhchem commented Mar 15, 2016

This is a "request for discussion" pull-request, because there is still some fine-tuning I'd like to do.

I am the inventor and author of the LaTeX package mhchem. MathJax includes an mhchem implementation that, over time, got a little bit disconnected from the LaTeX package. Having talked to dpvc and pkra, I'd like to take over maintenance of MathJax/mhchem and bring it up-to-date.

It resulted in a complete rewrite of the core part. This is my first take. Please discuss.

There are some inconsistencies. The most notable change is the switch to staggered charges, as has become the standard in recent years (IUPAC, many publishers). But there are a few other differences on close inspection. For testing and comparison, I downloaded 43.5k usages of \ce from chemistry.stackexchange.com. I went through 2k of them manually and checked the rest by script. A few hundred of these would break visually and I already fixed them on StackExchange. (Most of them were mere misuse of \ce, using it for non-chemical stuff.)

One thing, that still bothers me, are the reaction arrows. I consider it an important feature of mhchem, to have very readable input, not just nice output. Therefore, I would like to enforce a space before and after an reaction arrow. What keeps me from doing this, are the 500 or so StackExchange posts that I would need to edit. Still thinking. I don't know of any other large consumer of mhchem and would like to get it just right, just before others, like Wikipedia catch on.

As for this discussion, please comment on the code (quality, compatibility, style). Please discuss whether this should (with some incompatibilities!) replace the current mhchem extension or if this should become a 3rd-party extension.

BTW. It was not only a one-way street to bring LaTeX/mhchem features to MathJax/mhchem. I also have a list of 10 or so use cases, that I saw people using with MathJax, that I want to backport to LaTeX.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

First of all, thanks so much for your work on this so far. It is great to have the MathJax and LaTeX versions more in line. I'm also glad that the MathJax usage has inspired you to take some use cases back to the LaTeX version as well.

I am not worried about the differences like staggered indices. I think we can just mention that in the change log, and it will be OK. Can you make a more complete list of the differences that would affect the use of mhchem? We will need that for the change log, if nothing else.

I understand the desire to enforce your own view of correct spacing, but I'm not certain I would be too dogmatic about that. But it is your product, and if the LaTeX version enforces spacing rules, the MathJax version should as well. It might be nice to have a backward-compatibility mode, but we could always make the old mhchem available in the 3rd-party repository for those who want to continue using the previous version. I would like this new version to become the core version when it is ready.

I do have some comments on the code, which I will past separately below.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

My biggest concern about the code has to do with efficiency, in particular in the MhchemParser. Here, you create large objects and collections of functions dynamically each time a parser is created, even though these are basically static structures that really only need to be created once. For example, every invocation of the match() function creates the (large) f object and all its functions, all the stateMachines entries, and destroys them after the call is complete. In the test case \ce{H2O -> 2H + O}, the match() routine is called 178 times, so those data structures are created and destroyed 178 times for this relatively small equation. That means lots of needless compiling, memory usage, and garbage collection. The code you provide here is 2 to 4 times slower than the current version in my (simple, non-scientific) tests, which is not great.

It seems to me that your situation is taylor made to use prototypal inheritance for the methods and static copies of the structures, rather than dynamically creating them over and over. You are using closures in some instances, but only to get access to that, which could be easily passed as a parameter rather than using the needless overhead of closures.

If you don't want to create prototypes yourself, you can use MathJax's object model itself. E.g., you could do

  MhchemParser: MathJax.Object.Subclass({
    Init: function() {
      this.buffer = {};
    },

    parse: function (input, stateMachine) {
      ... (code using this.match(), etc.) ...
    },

    matches: {
      'empty': /^$/,
      'else': /^./,
      ...
    },
    match: function (m, input) {
      var f = this.matches;
      if (f[m]) === undefined) {
         ...
      } else if (typeof f[m] === "function") {
         ...
      } else {
         ...
      }
    },
    ...
  }),

There are similar things happening in a number of the service routines, like in texify(), where the types object is created and destroyed for every call to texify() (27 calls for the simple expression above). And within that types object, the function for bond creates the bonds array every time it is called, even though the array is completely static in nature. There is no need for the inefficiency involved in creating it within the bond function. I understand that it is convenient to include it there from a programming standpoint, but from en efficiency standpoint, it is not a good approach.

I think it would be good to reorganize the structure to create the structures once at compile time, rather than over and over at run time.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

Because the "Files changed" tab doesn't show the diff (since it is so large), I can't comment on individual lines directly, so I'll link them here.

Because object and array lookups are slower than variable lookups, it is more efficient to cache object and array looks up if they are used more than once. For example, in line 97, you could use

var transitions = that.stateMachines[stateMachine].transitions;
for (var iT = 0, nT = transitions.length; iT < nT; iT++) {
  var t = transitions[iT];
  ...

and similarly for the other loops.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

The lookup loops nested four layers deep seem a bit inefficient to me. In particular, something like lines 106 to 108 seem like they could be handled via direct lookup in an object rather than scanning through an array item by item, and doing a string search on the inner-most loop. Something like

  var action = t.actions[state] || t.actions['*'];
  if (action) {
    if (action.alias) action = t.actions[action.alias];
  ...
  }

where instead of

        {
          match: 'space',
          actions: [ { state: 'a', nextState: 'as' },
                     { state: '0|2', action: 'sb=false' },
                     { state: '1', action: 'sb=true' },
                     { state: 'r|rt|rd|rdt|rdq', action: 'output', nextState: '0' },
                     { state: '*', action: [ 'output', 'sb=true' ], nextState: '1'} ]
        }

you would have

        {
          match: 'space',
          actions: {
            a: { nextState: 'as' },
            0: { action: 'sb=false' }, '2': { alias: '0' },
            1: { action: 'sb=true' },
            r : { action: 'output', nextState: '0' },
            rt: { alias: 'r' }, rd: { alias: 'r' }, rdt: { alias: 'r' }, rdq: { alias: 'r' },
            '*': { action: [ 'output', 'sb=true' ], nextState: '1'}
        }

This approach will find the correct action immediately without having to loop through the options. The alias attribute can be used to allow multiple keys to use the same rule.

I suspect that some of the other loops could be handled similarly.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

For non-ascii characters, like those in line 184 and line 186, you should use unicode character references rather than explicit characters, which depend on the encoding. The encoding of the scripts loaded into a page depends on the encoding of the page itself, and we don't control that, so it is safer to use numeric references, as in

/^(?:[a-zA-Z\u03B1-\u03C9\u0391-\03A9]|.../

Similarly, in the error messages (like line 311), you should use \u2013 rather than an explicit en-dash.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

For things like the definitions of the various structures (like stateMachines at lines 398 through 403), you can do it all in one assignment, rather than making several assignments. This saves 24 or so characters for each assignment. For example,

      //
      // State machine transitions of main parser
      //
      that.stateMachines = {
        ce: {
          transitions = [
            {
              match: 'empty',
              actions: [ { state: '*', action: 'output' } ]
            },
            ...  (additional transitions) ...
          ],
          actions: {
            'o after d': function (m) {
              ...
            },
            ... (additional actions)
          }
        },
        ...  (additional machines) ...
      }

While this doesn't seem like much, those 25 characters repeated over and over do add up to several hundred characters saved. That does help reduce the size of the final mhchem extension that needs to be transferred to the browser.

@dpvc
Copy link
Member

dpvc commented Mar 21, 2016

The \color macro in MathJax is non-standard, and takes two arguments (the color and the math to be colored). On line 1150, you have used \color appropriately for this non-standard definition. However, MathJax has a color extension that redefined \color to be more compatible with the LaTeX usage, so on pages that include that, your use of \color will not be correct. You could use

 return '{\\color{' + buf.color1 + '}{' + texify(buf.color2) + '}}';

to make it work in both situations.

@mhchem
Copy link
Author

mhchem commented Mar 22, 2016

All of that really makes sense. Thanks!

I would have to learn how to do prototypal inheritance compatible with IE6. But, ... In fact, I only instantiate a new parser for keeping a local 'state' and 'buffer'. I could switch to good only procedural programming and just pass a newly created state/buffer object each time. That would solve the issues with instantiating constant objects all the time, wouldn't it?

Regarding '2': { alias: '0' }. That is indeed faster. However, it enlarges the code and makes it harder to read. I could also create the structure with real references instead of 'alias:'. That would be even faster. What do you think?

var a = {};
t.push({ match: 'space', actions: a });
a['a'] = { nextState: 'as' };
a['0'] = a['2'] = { action: 'sb=false' };
a['1'] = { action: 'sb=true' };
a['r'] = a['rt'] = a['rd'] = a['rdt'] = a['rdq'] = { action: 'output', nextState: '0' };
a['*'] = { action: [ 'output', 'sb=true' ], nextState: '1'};

@dpvc
Copy link
Member

dpvc commented Mar 22, 2016

I would have to learn how to do prototypal inheritance compatible with IE6.

I think we're willing to drop IE prior to IE9 at this point. In any case, the MathJax.Object.Subclass() call that I suggested handles all the prototype stuff for you, if you wanted to go that route.

I could switch to good only procedural programming and just pass a newly created state/buffer object each time.

That also works.

That would solve the issues with instantiating constant objects all the time, wouldn't it?

As long as the objects are not created within the routines you are calling, yes. That is, if they are all created outside the match() and MhchemParser() calls. The idea is to do them all once when the file is loaded rather than each time a function is called.

I could also create the structure with real references instead of 'alias:'. That would be even faster. What do you think?

That is fine. I was trying to make a version that could be used as an object literal rather than needing the extra variables like a in your example, but certainly what you have done works as well. There is some extra overhead in creating the structure this way (since the original empty object must be extended when each property is added to it), but as long as it is done once up front, that should be OK.

Thanks for being responsive to my comments. It is always traumatic to have someone else review your code.

@mhchem
Copy link
Author

mhchem commented Mar 23, 2016

The update might take me a while. (Do not worry, it wasn't traumatic. That's not the reason for my slow mode. Well, one thing. I was so proud to have my code IE6 compatible and now you say it wasn't necessary. I'll get over it! 😄)

@dpvc
Copy link
Member

dpvc commented Mar 25, 2016

No problem. I'm happy your version worked with IE6, and I suspect you will be able to do that for the update as well.

@mhchem
Copy link
Author

mhchem commented May 12, 2016

I incorporated all of your suggestions.

  • I create the lookup arrays only once. To compensate, I use traditional procedural programming to pass the buffer to each variables. This leads to an (at least) 50% reduced runtime for the parsing part. (And the timing is relatively stable. When timing my first version, I got large deviations between runs, possibly because the garbage collector kicked in sometimes.)
  • In for loops, I use the variable only once for lookup and cache that result.
  • I don't loop over the states in the stateMachine array, but use direct look-up. Only 2 of the 4 nested for loops remain.
  • There are no non-ASCII characters any more.
  • Definition of stateMachine is much shorter in source code.
  • \color enclosed in {}

This is the change log.

  • Staggered layout for charges (IUPAC style)
  • Improved spacing: 1/2 X^{2-} v, $n-1$ H^3HO(g), ^{22,98}_{11}Na, ...
  • Decimal amounts: 0.5 H2O, 0,5 H2O
  • Improved charge/bond/hyphen distinction: OH-(aq), \mu-Cl, -H- ...
  • Decimal and negative superscripts/subscripts: ^0_-1n-, ^{22.98}_{11}Na
  • Superscript electrons: OCO^{.-}
  • IUPAC fraction style: (1/2)H2O
  • Italic variables: n H2O, nH2O, NO_x, but pH
  • Improved scanning: Fe^n+, X_\alpha, NO_$x$, $\alpha$$a$, ...
  • Some unicode input, e.g. arrows
  • \bond{3}
  • {text} text escape
  • Arrow arguments now parsed as \ce: A ->[H2O] B, A ->[$x$] B
  • <--> arrow (detected by parser, but not yet typeset by MathJax core)
  • More opeators: A + B = C - D, \pm
  • Recursion works (\ce inside \ce)
  • Removed hardly used synonyms \cf and \cee
  • Excited state: X^*
  • Ellipsis vs bond: A...A, B...B, ...
  • Punctuation: , ;
  • Improved Fast-HTML rendering: \ce{A + _{a}X}
  • Side-effects for non-standard input

@dpvc
Copy link
Member

dpvc commented May 12, 2016

I have only taken a quick look, but I like what you have done. This looks much nicer, and the speed improvement is great to see. I'll give it a closer look next week, but my initial pass didn't spot anything, so I don't expect there to by any major issues.

Thanks for your hard work on this!

@dpvc
Copy link
Member

dpvc commented May 19, 2016

OK, I've gotten to go through your code more carefully, and I like what I see overall. I only have a few comments about performance. There is one fairly important change that (for the one fairly complicated expression that I tested) reduced the parsing time by 70 to 80%. That has to do with the definition of f inside the match() function at line 207. This f is created and destroyed each time match() is called (and my example had it called 867 times) even though the values in f don't depend on the parameters m and input that are passed to match() (at least not that I could see). So since f is constant, you can move it outside match(), e.g., by moving line 206 to between lines 349 and 350. That's all that is needed for the 75% improvement (though you could rename f to something more meaningful).

The other changes have much smaller impacts, but are still useful, and they are in the main loop in the go() function. Since stateMachine doesn't change within go(), you can cache mhchemParser.stateMachines[stateMachine] rather than re-evaluate it several times through each iteration of the outer loop. E.g.,

      var Machine = mhchemParser.stateMachines[stateMachine];
      iterateTransitions:
      for (var iT = 0; iT < Machine.transitions.length; iT++) {
        var t = Machines.transitions[iT];
        ...

and then use Machine in lines 147, 148, and 159 and 160. That gains another few percent of the parsing time.

Another minor improvement is to cache the Machine.transitions.length value as well:

      for (var iT = 0, iN = Machine.transitions.length; iT < iM; iT++) {

and similarly for the loop at line 143.

The only other thing that disturbs me is the repetition of the code in lines 147-154 at lines 159-166.

One solution would be to move that into a separate function that can be called in both places. Another would be to do the following:

            for (var iAA = 0, iAN = actions.length; iAA < iAN; iAA++) {
              var a = actions[iAA];
              var o, option = null;
              if (a.type) {option = a.option; a = a.type}
              if (typeof a === 'string') {
                if (Machine.actions[a]) {
                  o = Machine.actions[a](buffer, matches.match,option);
                } else if (mhchemParser.actions[a]) {
                  o = mhchemParser.actions[a](buffer, matches.match,option);
                } else {
                  throw ['InternalMhchemErrorNonExistingAction', 'Internal mhchem Error \u2013 Trying to use non-existing action (' + a + ')'];
                }
                output = mhchemParser.concatNotUndefined(output, o);
              } else if (typeof a === 'function') {
                o = a(buffer, matches.match);
                output = mhchemParser.concatNotUndefined(output, o);
              }
            }

There is a slight performance penalty to doing the test for a.type first, but it was negligible in my tests, and removing the duplicate code is a maintenance improvement.

Other than that, I think you are in good shape. Thanks for all your work on this.

@dpvc
Copy link
Member

dpvc commented May 19, 2016

Question: I notice that you have removed \cf and \cee. I don't know how much these have been used in the past. I know you did a lot of testing of existing equations. Can you say anything about how often (if at all) these are used in the wild?

@dpvc
Copy link
Member

dpvc commented May 19, 2016

Question: Your change from requiring \text{...} to only needing {...} may cause some existing equations to change their meaning. Do you have any idea of how many expressions will change on chemistry.stackexchange, for example?

@dpvc
Copy link
Member

dpvc commented May 19, 2016

@pkra, we need to decide how we want to handle the changes from the old to the new version. Do we want to simply replace the current version with this new one? Do we want to keep two versions for now? Do we want to move the old one to the contrib CDN and take this one? Do we want to move BOTH to the contrib CDN and not have it in the core at all? If we keep two versions in core, how should they be named? Should mhchem just be a shell that loads the old or new version based on a configuration parameter?

@dpvc
Copy link
Member

dpvc commented May 19, 2016

PS, some of the breaking changes include the change from \text{...} to just {...}, that super- and subscript now are staggered, and the fact that spaces now can be significant. For example, \ce{^1_0A} is different from \ce{^1_0 A}; the former produces prescripts on A, while the latter makes staggered script on an empty base, followed by a space and an A.

@mhchem
Copy link
Author

mhchem commented May 19, 2016

I already have a fix for \ce{^1_0 A} on my disk, which I will push in a few minutes. Input like that was never planned, but as it would not break anything, I included it.

There are a few inconsistencies that I'd rather not support like \ce{A^{2}-} (as it would break some bonds, but only occurred twice or so in 44k) and \frac[1][2] (as I think this is too bar off the LaTeX syntax).
I fixed almost all instances on StackExchange that would lead to errors or semantically wrong output. A few are still to do.

Regarding \cf, it was used only a few dozen times. I fixed it all on SE and realized that it came down to two or so authors who used it. There was only one instance of \cee. (Numbers are estimations, off my head. I did not write them down.)

A push with your code suggestions will follow later.

- support for exotic input like ^{12}_6 C
- improved adaptive spacing before macros (2 \color{red}{H2O}, 2\color{red}{H2O})
- support for amounts like 2$ H2O (2n H, 2nH, n$ H, n, 2$ H, 2)
- recognize space-comma as comma
@dpvc
Copy link
Member

dpvc commented May 19, 2016

Thanks for the additional information on the inconsistencies. It sounds like you have StackExchange covered. As long as we provide a means of continuing to use the old version (even if that takes a configuration change), I think that should be sufficient for the others. @pkra, do you have any feelings on that?

@pkra
Copy link
Contributor

pkra commented May 20, 2016

@dpvc wrote

we need to decide how we want to handle the changes from the old to the new version.

Right. I don't know the new version yet so maybe better discuss this at the next F2F?

@dpvc
Copy link
Member

dpvc commented May 20, 2016

@pkra wrote

I don't know the new version yet so maybe better discuss this at the next F2F?

I don't know all the differences, either. The initial post includes a discussion of these. @mhchem, could you say what edits you had to make to the chemistry StackExchange site? Was it just the changing the misuses of \ce, or where there additional changes?

The one that concerns me is that {...} now acts like \text{...}, and I don't know what impact that will have. I looked at the files in our test suite, and this did impact at least one of them, but that was an unusual case.

@mhchem
Copy link
Author

mhchem commented May 20, 2016

Oh, I do not have a record of what I changed at StackExchange. In our first conversation you were quite relaxed regarding incompatible changes, so I did not keep a fine-graned documentation.

Most of the edits were definitely regarding misuse, i.e. using it for non-chemical things. Some things were rare non-intended use, like a unicode character that no one else ever used. Some things were bugs of the old version, e.g. \ce{^4_2^He}.

And some things were really incompatiblities like \ce{->[K]} or \ce{X^*}. (A couple of these can only be changed at StackExchange after the update).

All my edits are listet at http://chemistry.stackexchange.com/users/24052/mhchem?tab=activity&sort=all (Often, it's hard to really see the changes unless you switch to markdown view). At the beginning, I made some edits that I thought will become incompatibilities, but I later decided to support this kind of input. Would you think it is necessary that I revisit all my edits and categorize them?

@dpvc
Copy link
Member

dpvc commented May 20, 2016

No, sorry, I just thought you might remember what the main issues were. If most were incorrect usage, then I'm not concerned. I just thought it would help us decide how to proceed if we knew the nature of the things that might go wrong when the new version is in place.

@dpvc
Copy link
Member

dpvc commented May 20, 2016

Just to be clear, I don't think you need to go back through your edits to categorize.

@mhchem
Copy link
Author

mhchem commented May 23, 2016

The performance improvements by your suggestions are quite impressive. Thanks!

(Regarding the 'patterns' object, I just changed indentation. Surprisingly, the diff algorithm cannot the that.)

@dpvc
Copy link
Member

dpvc commented May 27, 2016

Your changes look good. I think it is ready to go (though you could still cache mhchemParser.patterns[m] in mhchemParser.match, but that is very minor). We still need to decide how to handle replacing the old version best, so I'm holding off merging until we know how that will work.

@mhchem
Copy link
Author

mhchem commented Jul 14, 2016

Superseded by mathjax/MathJax-third-party-extensions#28

@mhchem mhchem closed this Jul 14, 2016
@pkra
Copy link
Contributor

pkra commented Jul 14, 2016

And also #1537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants