Links

Categories

Tags


« | Main | »

Removing duplicates from a string array

By Jewe | December 13, 2014

Removing duplicate values from an array of strings is not a common necessity, therefore there’s no built-in function for this in JewelScript’s string or array class. However, there are cases where this kind of functionality is required. Here’s a tip on how to do it efficiently.

Our first approach

The simple approach to filtering a list of strings would be to use nested loops. We’ll start with this simple, but not very efficient example and work our way to a smarter implementation:

import stdlib;

function string main(const string[] args)
{
    string[] names = {"Judy", "Rick", "Helen", "James", "Sandra", "Rick", "Elisabeth", "Judy"};
    string[] res = RemoveDuplicates(names);
    for (int i = 0; i < res.length; i++)
        stdlib::println(res[i]);
    return "";
}

function string[] RemoveDuplicates(string[] values)
{
    string[] result;
    // iterate over each value in 'values'
    for (int i = 0; i < values.length; i++)
    {
        string v = values[i];
        // check if 'v' is already in 'result'
        int j;
        for (; j < result.length; j++)
        {
            if (v == result[j])
                break;
        }
        // add 'v' to 'result' if not found
        if (j == result.length)
            result += v;
    }
    return result;
}

The above RemoveDuplicates() function will serve our purpose, but it’ll cause a lot of iterations. In fact, the number of iterations will grow exponentially due to the two nested loops. So if we pass a very large array of, say, thousand values, then this will become very slow.

Fortunately there are better ways. The built-in types of the JewelScript runtime offer enough native functionality to do most of the algorithm natively. The key to this new implementation is the table class.

Using the hash table class

Hash tables are predestined for this kind of purpose, because they use an arbitrary string value as the ‘key’ to look up any value stored in them. On top of that, hash tables are extremely fast when accessing a value by their key. So what we need to do is to store every value from our string array in a hash table, and then convert the table back to an array:

function string[] RemoveDuplicates(string[] values)
{
    // a table to filter our array
    table filter;
    // store all values in the table
    for (int i = 0; i < values.length; i++)
        filter.set(values[i], values[i]);
    // move all table values into an array and return it
    return filter.toArray();
}

The interesting aspect of hash tables is, the storage location they use is determined by the entropy of the key. This means, any possible key-value has only one storage location in the table. Because we use the string value from the array as a key, any duplicate string value will just overwrite any previous value in the table.

In even simpler words: If the array contains ‘Rick’, then ‘Rick’ is stored under the key ‘Rick’ in the table. Should ‘Rick’ occur again in the array, then the new ‘Rick’ will just overwrite the old ‘Rick’ we already stored.

Converting the entire content of the table back to a string array is very simple. As you can see, the table class has a toArray() method for that. It is noteworthy that this method is highly recursive internally, so it will become slower as the table grows. However, since it’s implemented in native C, it’s still way quicker than anything we could do with nested loops in JewelScript.

Further room for optimization

There’s still a loop in our function that we could get rid of, if we wanted to: The loop that puts all string values from the array into the table. We could speed that up by using the array’s enumerate() method. This will do the loop natively and just call the essential piece of script code as a delegate:

function string[] RemoveDuplicates(string[] values)
{
    // a table to filter our array
    table filter;
    // store all values in the table
    values.enumerate((v, t) => { t.table::set(v, v); }, filter);
    // move all table values into an array and return it
    return filter.toArray();
}

This probably needs a little explaining. The enumerate method accepts two parameters:

  1. A reference to a function that should be called for every item in the array
  2. An additional reference to any data we wish to pass on to the function

As you can see, we pass a lambda-expression as the first parameter, basically a local “mini-function”. The second parameter is our filter table.

When the lambda-expression is executed by the enumerate method, it gets passed two arguments, which we have named v and t. The first argument ‘v’ is the current iterator value from our string array. The second argument ‘t’ is our filter table.

The delegate declared for the enumerate method is generic, because it must work with any type of data an array might contain. Therefore both arguments of the delegate are type-less variables. This requires us to explicitly cast ‘t’ to table in the lambda-expression. Hence the expression t.table::set(v, v).

There is another noteworthy thing about the table::toArray() method. Due to the recursive algorithm used by the method, the resulting array will be sorted in ascending alphabetical order (more precisely ascending ASCII-code order) of the key. Depending on your application, this may be desired or not. It’s also the main difference (aside from efficiency) to our first approach, which will keep the order of values in the array intact.

Topics: coding tips, news | Comments Off on Removing duplicates from a string array

Comments are closed.