Friday 12 August 2016

Scanning is slow - should I split?

Almost 4 years ago, I wrote a post entitled Scanning is slow.  Well, to be fair, it is.

The example code that I gave was for counting how many times a substring appeared within a string...

  temp = list
  total = 0
  scan temp,"ABC"
  while $result > 0 )
    total = total+1
    temp = temp[$result+3]
    scan temp,"ABC"
  endwhile

I gave a perfectly simply alternative which performed much better in this scenario, using $replace and $itemcount

However, I found myself in a similar situation yesterday, but I couldn't use this trick, because I needed to loop through, searching for certain substrings, but without changing the string that I was looping through.

A thought occurred to me, which was that I was always taking the string from after the value I'd found and scanning again, so in this case, what about using $split instead?

To compare against to above, I came up with something like this...

  temp = list
  total = 0
  $result = $split(temp,1,"ABC",lhs,rhs)
  while ( $result > 0 )
    total = total+1
    temp = rhs
    $result = $split(temp,1,"ABC",lhs,rhs)
  endwhile

I tested this by doing 500 iterations, on a string with 500 instances (so total was 250000 at the end)...

$scan: 20.94, 20.67, 20.85 - almost 21 seconds.
$split: 27.09, 27.48, 26.83 - around 27 seconds.

So $split definitely looked worse off, it looked like $scan was better suited to what I needed (as I wasn't interested in the left hand side).

However, these tests have been seeing whether I can use $split to improve my $scan loop, and this seemed a little unfair now I'd proven it couldn't, so I thought I'd flip it on it's head and see if $scan is better at doing what $split is designed for.


So for $split this is quite easy, a simple line which splits the string into two parts...

  $result $split(temp,1,"~",lhs,rhs)

Using $scan this is a little more complicated, and requires a little string manipulation...

  $result $scan(temp,"~")
  lhs = temp[1:$result-1]
  rhs = temp[$result+1]


When I compared the two over 2,000,000 iterations, I got these results...

$split: 10.19, 9.78, 10.01 - around 10 seconds.
$scan: 16.60, 16.51, 16.82 - over 16 seconds.

So I've worked out that $split is good for splitting and $scan is good for scanning... Why am I writing this up in a blog post, you might wonder!

Well I then considered the fact that I was splitting a string on a delimiter that I knew existed.  And in the real world situation I was working on, this wasn't a fair assumption, as often the delimiter would not be there.  So I tried this again, but with a different delimiter, that wouldn't be found in the string...

$split: 10.66, 10.73, 10.87 - a little over 10 seconds.
$scan: 9.20, 9.00, 8.44 - around 9 seconds.

So it looks like it's not quite as straightforward as I thought.  However, the saving when it is found is quite a lot larger than the loss when it's not found, so maybe $split is still best overall.

Summary: If you're looking to scan through a string, use $scan, and if you're looking to split a string (especially if you expect the delimiter to be there), use $split.