Skip to content

The end of orddicts as lists? #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
josevalim opened this issue Mar 31, 2012 · 10 comments
Closed

The end of orddicts as lists? #216

josevalim opened this issue Mar 31, 2012 · 10 comments

Comments

@josevalim
Copy link
Member

Implementing the access protocol has exposed a flaw in the current approach of having orddicts as lists, without wrapping them. The issue is that an orddict is simply a list of two-items tuples:

[a: 1, b: 2]
[{:a,1},{:b,2}]

That said, when we use the access protocol as in variable[1], it could either mean: getting an item from the list (in the orddict case this item would be a tuple) or getting a value in the orddict whose key is 1. We have five options to solve this problem:

1. Do not implement the access protocol for orddicts

This means variable[1] would always be considered a list access, there is no quick access to the orddict values except by Orddict.get(orddict, key). I don't like this approach because it makes orddict access verbose and it is quite common in Elixir due key-value args.

2. Do not implement the access protocol for lists

This means variable[1] would always do a key lookup in the orddict. Accessing list items could be done with the list module. This is not that bad considering developers shouldn't be accessing lists by indexes in any case (since it is linear). However, we may extend the protocol in the future to support ranges variable[1..3] and sometimes a reverse access is convenient (as in variable[-1]) which makes this option not as good.

3. Have a mixed protocol

One option is to have variable[1] meaning list access and variable[:foo] (with an atom) meaning orddict access. This will handle most of the cases with key-value args, but may be a source of confusion.

4. Create a specific data type for Orddicts

Another option is, instead of making orddicts as lists, we could wrap them in a record, which would basically be { Orddict, list }. This allows them to have their own protocols and therefore we could have both semantics applying clearly. Theoretically, wrapping orddicts in a tuple would reduce performance but this couldn't be verified in practice.

5. Limit the scope of Orddicts

A final option is to limit the scope of Orddicts to only have atoms as keys. This would allows us to still represent the orddicts as lists and have the mixed access as in the option 3. We could eventually rename the Orddict module to Keyword. A general purpose Orddict class could be added somewhere down the road.

Conclusion

I am currently divided between options 4 and 5. More feedback is welcome.

/cc @wycats @Raynes @alco @rafaelfranca @fxn

@fxn
Copy link
Contributor

fxn commented Mar 31, 2012

I'd personally lean on option 4. A new data type, because the ordering in orddicts is not lineal, they are not lists of tuples in the simple sense of the expression where list implies order by position (push order).

In some languages hash tables are ordered, insertion order. But they do not offer index-access. Hashes in Ruby 1.9 are one example of this. They may offer iterators, keys or values accessors that respect the order, but not an index-based access.

From my perspective, it would be natural that the interface of orddicts be dominated by keys as well.

It could be the case that Elixir users would expect index-based access on the sorted ordering of orddicts, don't have enough context to know if that is the case. But if that was not expected, I think I would not provide it.

Then, the language can afford having ordinary tuples in orddicts, rather than special-casing them to constrain them.

Just food for thought :).

@EricHripko
Copy link

IMHO: Orddict should be implemented as binary tree.

@rafaelfranca
Copy link
Contributor

I prefer the option 4 too.

@Raynes
Copy link
Contributor

Raynes commented Mar 31, 2012

Option 4 seems like the best option here. I'd feel strange not supporting the access protocol on both lists and orddicts

@KlausTrainer
Copy link

I'd prefer option 4, as it is the cleanest. At this point caring too much about a possible performance decrease would be premature optimization. There will be options for reimplementing orddicts to perform better than plain lists anyway.

Generally, I think that it's not a good idea to choose an implementation that would automatically put constraints on how we can use or what we can implement for other types.

orddicts deserve their own type. They probably shouldn't represent just a different view upon lists (i.e., only provide an API for dealing with lists), without having their own type that can be inspected not only at compile- but also at runtime (e.g. with is_orddict/1).

If the internal representation is a list, so well. However, making it easy (or even encourage) to treat orddicts as lists feels wrong to me. Aside from possibly creating that said source of confusion, changing the implementation of orddicts will very likely not be possible without breaking existing code then, which also feels wrong to me.

@alco
Copy link
Member

alco commented Mar 31, 2012

I propose a mix of 2, 4, 5. This post goes kind of off-topic in the end, but since we touched on the Orddict question, I'd like to express some thoughts on the subject.

Having access protocol for lists does not sound compelling to me. The main reason is that lists are linked, not contiguous. They shouldn't be used like arrays are used in other langs. Therefore, I think is it better to use List.nth, List.last for element access. Lisp probably has the longest history with linked lists and it always had car, cdr, nth, etc. interface for lists, and it's perfectly fine.


As for dicts, I think Orddict might become a source of bugs because it doesn't abstract its implementation details. Erlang has several data types for storing key-value data for historical reasons. Elixir could provide a better abstraction with a single Dict module that would define a separate constructor for each dict type, but all of the element access functions will be the same for all types (they can be specialized for selected types via protocols). Consider an example:

d = [a: 1, b: 2, c: 3]
x = Orddict.get d, :a
modified_x = f(x)
Orddict.put d, modified_x

If at some point I decide to replace Orddict with another kind of dictionary, I'm screwed. I could do refer OtherDict, as: Orddict, but it can quickly become a maintenance hell.

On the other hand, having a single module for associative containers, we would do the following:

d = [a: 1, b: 2, c: 3]
x = Dict.get d, :a
modified_x = f(x)
Dict.put d, modified_x

Now, if I want to use a different kind of dict, I only need to change the constructor.


Coming back to the access protocol, basically my position is as follows:

  1. Provide a better abstraction for key-value containers. Users must not be tempted to meddle with Orddicts (and break their own code) only because they are implemented as lists of tuples.
  2. Probably limit Orddict keys to atoms only. If this is a best practice already, making it a requirement would not hurt. It could raise problems with Erlang interoperation, however. It's not going to be easy to maintain proper balance between abstracting away implementation details of data types and still providing clean interoperability with Erlang code.

As a closing thought, I think it's definitely better to add something down the road then remove something. That is to say, limiting access protocol to be used with dicts only does not prevent you from adding it to lists later on.

@alco
Copy link
Member

alco commented Mar 31, 2012

I forgot to mention the main point I was trying to make. Please consider this previous statement invalid

Provide a better abstraction for key-value containers. Users must not be tempted to meddle with Orddicts (and break their own code) only because they are implemented as lists of tuples.

Here's what I really meant to say.

I believe we all recognize the importance of protocols. Any type should be treated according to the functionality it provides, not the way it's implemented. If a type implements the Dict protocol, it means that the type guarantees it can associate values with keys and do faster-than-linear key lookup, and the user can rely on it. Everything else is volatile.

If this concept is stated clearly throughout the documentation, users generally shouldn't care whether their Orddict is just a list. In other words, it's not implementor's problem if a user has broken his Orddict by treating it as a list. If the docs properly convey the idea that a user should rely only on those things that are promised in protocols, the user should not be tempted to treat Orddict as a list.

@josevalim
Copy link
Member Author

@alco, good points. The decision 4 will already hide the fact an Orddict is simply a list of tuples, so we will be half way. I also think a Dict protocol will be needed somewhere down the road.

About having an integer accessor for lists, I will cave in. I will remove the integer based access, because it is a bad practice, although I will likely allow the list[1..10](when we have ranges) somewhere down the road.

Thanks everyone for the feedback.

@josevalim
Copy link
Member Author

I am halfway the implementation of the proposal #4 and making orddicts no longer be lists is introducing, much, much pain in the codebase. The main reason is because a pair of tuples is a common abstraction in Erlang and if we move away from it, interacting with Erlang code becomes (much) harder.

So I am leaning towards option #5. I will rename the wording from orddicts to keywords and the default Keyword (following Smalltalk parlance) module only accepts atoms as keys. We will eventually provide one of Orddict, Dict, GBTrees modules for generalized dictionaries.

@alco
Copy link
Member

alco commented Apr 1, 2012

The original problem as stated in the first post:

when we use the access protocol as in variable[1], it could either mean: getting an item from the list (in the orddict case this item would be a tuple) or getting a value in the orddict whose key is 1.

The least confusing solution is to not implement access protocol for lists at the moment. This solves the ambiguity. I'm voting for this one.


The discussion proposing different options has been raised due to the fact that we want to implement access protocol for lists after all, so now we need a new dict. One that would work seamlessly with Erlang like orddict does.

José says that the problem with the { Orddict, [<values>] } approach is that many Erlang functions accept lists of tuples (proplist). We can use a simple [option1: <value>, option2: <value>] syntax for this purpose with the current orddict, but it'll become impossible with a new one.

What can be done? In my previous remark about protocols I was describing an implementation that could look somewhat like the following:

# This syntax remains as a shortcut for creating list of tuples.
# NOT orddicts, mind you
[c: 1, b: 2, a: 3]
#=> [{:c, 1}, {:b, 2}, {:a, 3}]

:erlang.some_function_accepting_proplist [option: <value>]
#=> :ok

dict = Dict.new_orddict [c: 1, b: 2, a: 3]
#=> { :orddict, [{:a, 3}, {:b, 2}, {:c, 1}] }

dict = Dict.new_orddict [:c, :b, :a], [1, 2, 3]
#=> { :orddict, [{:a, 3}, {:b, 2}, {:c, 1}] }

Dict.get dict, :a
#=> 3

Dict.put dict, :b, 100
#=> { :orddict, [{:a, 3}, {:b, 100}, {:c, 1}] }

tree = Dict.new_gb_tree [c: 1, b: 2, a: 3]
#=> { :gb_tree, <something> }

# An alternative constructor. Basically, GBTree contains nothing
# but constructors like .from_orddict, .from_pairs, ...
tree = GBTree.from_orddict dict
#=> { :gb_tree, <something> }

Dict.get tree, :a
#=> 3

# Dict.to_orddict can already be implemented with protocols
:erlang.some_function_accepting_orddict Dict.to_orddict dict
:erlang.some_function_accepting_orddict Dict.to_orddict tree

keyword = Dict.new_keyword [{1, 1}, {2, 2}]
#=> RuntimeError: keys must be atoms

keyword = Dict.new_keyword [a: 1, b: 1]
#=> { :keyword, [{:a, 1}, {:b, 1}] }

:erlang.some_function_accepting_proplist Dict.to_orddict keyword

So, yes, if we want to pass orddicts between Elixir and Erlang, we will need glue. But the described approach allows us to implement access protocol for all kinds of dicts and lists while still maintaining interoperability without glue in some cases. In the worst case, we could add a syntactic sugar to replace Dict.to_orddict.

So, basically we can get rid of glue by not implementing access protocol for lists (#2) now or restricting orddicts to use only atoms as keys (#5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants