Read Online
Download PDF
Additional resources
Revision history
Known typos/bugs
Report a bug
License terms
About the author
"node290_1.gif" "node290_2.gif" "node290_3.gif"

Union, Intersection and Complement: performance  with a user-defined SameTest option

As we mentioned in the text, the operations Union, Intersection and Complement with the SameTest option may perform slower or much slower than without it. We will only illustrate this for Union, but the same is true also for the other two.

Consider some test list


For example, we may consider elements the same if they are either equal or  differ by 2:



The topics of rules, options and pure functions, which syntax we just used, may have not been covered yet, in which case  ignore the syntax details for now.

The problem is the following. The Union operation based on the default sorting function is very fast, but it may become a lot slower with a user-defined SameTest option. It shares this property with the Sort function (to be discussed next), essentially due to the way that Union operation is organized: it first sorts the list and then the same elements will always be adjacent to each other and thus easier to eliminate. To illustrate this point, consider a larger list:







We see that it is more then a 1000(!) times slower with this non-trivial "sameness" function, for this size of the list.

This illustrates several things. First of all, if we think of it, the specific problem and the notion of "sameness" as formulated above is ill-posed, because depending on the order in which the Union operation is performed, we will get different results. For example, consider  a list {2,4,6}: if <4> is eliminated first, we get {2,6}, but if <6> is eliminated first, we get just {2}. Essentially the problem here is that our notion of sameness is not transitive. Perhaps a more meaningful formulation in this case would be to locate all chains of numbers with each one different by 2 from the next one in the chain. In any case, one has to make sure that the problem is well-formulated, and the sameness function better be transitive.

However, ignoring this issue for the moment, the sameness function above illustrates another point well: it is a stronger requirement to provide a sorting function (which we did not do, and also the syntax of Union does not allow us to), than to provide the sameness function, because in the former case we have to define not just the notion of same, but also notions of greater and smaller (actually, for sorting purposes, the notion of same is less important than the latter two). However it is exactly the existence of the sorting function (criteria) which allows to map our set (list) to say natural numbers and thus  reduces the computation time from quadratic in the length of the list (that is, if we just compare all elements pairwise), to log-linear. And just the fact that the built-in function Union takes the sameness function does not mean that it translates it efficiently into a sorting function - this is not always possible at all, and in any case is a (non-trivial in general)  programmer's task. Thus, we should not expect miracles, but rather should reformulate the problem such that the proper sorting function is available (if possible, of course).

In fact, it is even better if we can reformulate a problem such that instead of the sorting function applied to elements of our list, we can use some key function which computes a key (say, integer number) for each element in the list, so that the majority of subsequent operations are performed with keys rather than the original elements. In Mathematica  such approach often gives a large speed advantage since many operations are much faster with numbers  than with arbitrary symbolic expressions. In cases when such key function is available, there are several techniques which can be efficiently used to replace Union.  We will exploit this technique many times later, but for now let us just consider another problem as an illustration of these statements.

We will use the same large list as above, but define two numbers the same when they have the same remainder of division by 60. Here is (without explanation) the better code to eliminate the duplicate elements:



This is how long it takes to do so:



For versions of Mathematica  earlier than 5.0, where Reap-Sow operations were not yet available, one can use the following code (slight extension of the technique used by Carl Woll), which takes about twice longer to execute than the one above (once again, we provide it here for illustration and timing comparison, so please ignore the code for now - we will revisit it later):



And this is the result of Union with the SameTest option and its timing:



We get about 25 times speed-up with Reap-Sow method and about 10 times with the Woll's technique, with respect to the one using Union, for this size of the list. If we make a list larger, the difference will be even more dramatic.

To conclude this rather long discussion, there can be a huge difference in performance of Union depending on whether it is used in its "pure" form or some "sameness" function is provided. In the latter case, and if the list is any large, it is advisable to first analyze the problem, because there could be superior alternatives. Also, this behavior is not entirely the fault of Union, but partly reflects the fact that there is no general efficient solution for eliminating same elements from the list if all we have is just the sameness function, but not a comparison function.



"node290_23.gif" "node290_24.gif" "node290_25.gif"

Created by Wolfram Mathematica 6.0  (05 February 2009) Valid XHTML 1.1!