finding common prefix of array of strings

0 votes
asked Aug 26, 2009 by bumperbox

I have an array like this:

$sports = array(
'Softball - Counties',
'Softball - Eastern',
'Softball - North Harbour',
'Softball - South',
'Softball - Western'
);

I would like to find the longest common prefix of the string. In this instance, it would be 'Softball - '

I am thinking that I would follow this process

$i = 1;

// loop to the length of the first string
while ($i < strlen($sports[0]) {

  // grab the left most part up to i in length
  $match = substr($sports[0], 0, $i);

  // loop through all the values in array, and compare if they match
  foreach ($sports as $sport) {

     if ($match != substr($sport, 0, $i) {
         // didn't match, return the part that did match
         return substr($sport, 0, $i-1);
     }

  } // foreach

   // increase string length
   $i++;
} // while

// if you got to here, then all of them must be identical

Questions

  1. Is there a built in function or much simpler way of doing this ?

  2. For my 5 line array that is probably fine, but if I were to do several thousand line arrays, there would be a lot of overhead, so I would have to be move calculated with my starting values of $i, eg $i = halfway of string, if it fails, then $i/2 until it works, then increment $i by 1 until we succeed. So that we are doing the least number of comparisons to get a result.

Is there a formula/algorithm out already out there for this kind of problem?

16 Answers

0 votes
answered Jan 26, 2009 by redtuna
  1. not that I know of

  2. yes: instead of comparing the substring from 0 to length i, you can simply check the ith character (you already know that characters 0 to i-1 match).

0 votes
answered Jan 30, 2009 by porg

@bumperbox

  1. Your basic code needed some correction to work in ALL scenarios!

    • Your loop only compares until one character before the last character!
    • The mismatch can possibly occur 1 loop cycle after the latest common character.
    • Hence you have to at least check until 1 character after your first string's last character.
    • Hence your comparison operator must be "<= 1" or "< 2".
  2. Currently your algorithm fails

    • if the first string is completely included in all other strings,
    • or completely included in all other strings except the last character.

In my next answer/post, I will attach iteration optimized code!

Original Bumperbox code PLUS correction (PHP):

function shortest($sports) {
 $i = 1;

 // loop to the length of the first string
 while ($i < strlen($sports[0])) {

  // grab the left most part up to i in length
  // REMARK: Culturally biased towards LTR writing systems. Better say: Grab frombeginning...
  $match = substr($sports[0], 0, $i);

  // loop through all the values in array, and compare if they match
  foreach ($sports as $sport) {
   if ($match != substr($sport, 0, $i)) {
    // didn't match, return the part that did match
    return substr($sport, 0, $i-1);
   }
  }
 $i++; // increase string length
 }
}

function shortestCorrect($sports) {
 $i = 1;
 while ($i <= strlen($sports[0]) + 1) {
  // Grab the string from its beginning with length $i
  $match = substr($sports[0], 0, $i);
  foreach ($sports as $sport) {
   if ($match != substr($sport, 0, $i)) {
    return substr($sport, 0, $i-1);
   }
  }
  $i++;
 }
 // Special case: No mismatch happened until loop end! Thus entire str1 is common prefix!
 return $sports[0];
}

$sports1 = array(
'Softball',
'Softball - Eastern',
'Softball - North Harbour');

$sports2 = array(
'Softball - Wester',
'Softball - Western',
);

$sports3 = array(
'Softball - Western',
'Softball - Western',
);

$sports4 = array(
'Softball - Westerner',
'Softball - Western',
);

echo("Output of the original function:\n"); // Failure scenarios

var_dump(shortest($sports1)); // NULL rather than the correct 'Softball'
var_dump(shortest($sports2)); // NULL rather than the correct 'Softball - Wester'
var_dump(shortest($sports3)); // NULL rather than the correct 'Softball - Western'
var_dump(shortest($sports4)); // Only works if the second string is at least one character longer!

echo("\nOutput of the corrected function:\n"); // All scenarios work
var_dump(shortestCorrect($sports1));
var_dump(shortestCorrect($sports2));
var_dump(shortestCorrect($sports3));
var_dump(shortestCorrect($sports4));
0 votes
answered Aug 1, 2009 by mishoo

Here's an elegant, recursive implementation in JavaScript:

function prefix(strings) {
    switch (strings.length) {

      case 0:
        return "";

      case 1:
        return strings[0];

      case 2:
        // compute the prefix between the two strings
        var a = strings[0],
            b = strings[1],
            n = Math.min(a.length, b.length),
            i = 0;
        while (i < n && a.charAt(i) === b.charAt(i))
            ++i;
        return a.substring(0, i);

      default:
        // return the common prefix of the first string,
        // and the common prefix of the rest of the strings
        return prefix([ strings[0], prefix(strings.slice(1)) ]);
    }
}
0 votes
answered Aug 26, 2009 by chaos

Probably there is some terribly well-regarded algorithm for this, but just off the top of my head, if you know your commonality is going to be on the left-hand side like in your example, you could do way better than your posted methodology by first finding the commonality of the first two strings, and then iterating down the rest of the list, trimming the common string as necessary to achieve commonality or terminating with failure if you trim all the way to nothing.

0 votes
answered Aug 26, 2009 by diogoriba

I think you're on the right way. But instead of incrementing i when all of the string passes, you could do this:

1) Compare the first 2 strings in the array and find out how many common characters they have. Save the common characters in a separate string called maxCommon, for example.

2) Compare the third string w/ maxCommon. If the number of common characters is smaller, trim maxCommon to the characters that match.

3) Repeat and rinse for the rest of the array. At the end of the process, maxCommon will have the string that is common to all of the array elements.

This will add some overhead because you'll need to compare each string w/ maxCommon, but will drastically reduce the number of iterations you'll need to get your results.

0 votes
answered Aug 26, 2009 by gumbo

I would use this:

$prefix = array_shift($array);  // take the first item as initial prefix
$length = strlen($prefix);
// compare the current prefix with the prefix of the same length of the other items
foreach ($array as $item) {
    // check if there is a match; if not, decrease the prefix by one character at a time
    while ($length && substr($item, 0, $length) !== $prefix) {
        $length--;
        $prefix = substr($prefix, 0, -1);
    }
    if (!$length) {
        break;
    }
}

Update   Here’s another solution, iteratively comparing each n-th character of the strings until a mismatch is found:

$pl = 0; // common prefix length
$n = count($array);
$l = strlen($array[0]);
while ($pl < $l) {
    $c = $array[0][$pl];
    for ($i=1; $i<$n; $i++) {
        if ($array[$i][$pl] !== $c) break 2;
    }
    $pl++;
}
$prefix = substr($array[0], 0, $pl);

This is even more efficient as there are only at most numberOfStrings‍·‍commonPrefixLength atomic comparisons.

0 votes
answered Aug 31, 2009 by porg

I implemented @diogoriba algorithm into code, with this result:

  • Finding the common prefix of the first two strings, and then comparing this with all following strings starting from the 3rd, and trim the common string if nothing common is found, wins in situations where there is more in common in the prefixes than different.
  • But bumperbox's original algorithm (except the bugfixes) wins where the strings have less in common in their prefix than different. Details in the code comments!

Another idea I implemented:

First check for the shortest string in the array, and use this for comparison rather than simply the first string. In the code, this is implemented with the custom written function arrayStrLenMin().

  • Can bring down iterations dramatically, but the function arrayStrLenMin() may itself cause ( more or less) iterations.
  • Simply starting with the length of first string in array seems quite clumsy, but may turn out effective, if arrayStrLenMin() needs many iterations.

Get the maximum common prefix of strings in an array with as little iterations as possible (PHP)

Code + Extensive Testing + Remarks:

function arrayStrLenMin ($arr, $strictMode = false, $forLoop = false) {
    $errArrZeroLength = -1; // Return value for error: Array is empty
    $errOtherType = -2;     // Return value for error: Found other type (than string in array)
    $errStrNone = -3;       // Return value for error: No strings found (in array)

    $arrLength = count($arr);
    if ($arrLength <= 0 ) { return $errArrZeroLength; }
    $cur = 0;

    foreach ($arr as $key => $val) {
        if (is_string($val)) {
            $min = strlen($val);
            $strFirstFound = $key;
            // echo("Key\tLength / Notification / Error\n");
            // echo("$key\tFound first string member at key with length: $min!\n");
            break;
        }
        else if ($strictMode) { return $errOtherType; } // At least 1 type other than string was found.
    }
    if (! isset($min)) { return $errStrNone; } // No string was found in array.

    // SpeedRatio of foreach/for is approximately 2/1 as dicussed at:
    // http://juliusbeckmann.de/blog/php-foreach-vs-while-vs-for-the-loop-battle.html

    // If $strFirstFound is found within the first 1/SpeedRatio (=0.5) of the array, "foreach" is faster!

    if (! $forLoop) {
        foreach ($arr as $key => $val) {
            if (is_string($val)) {
                $cur = strlen($val);
                // echo("$key\t$cur\n");
                if ($cur == 0) { return $cur; } // 0 is the shortest possible string, so we can abort here.
                if ($cur < $min) { $min = $cur; }
            }
        // else { echo("$key\tNo string!\n"); }
        }
    }

    // If $strFirstFound is found after the first 1/SpeedRatio (=0.5) of the array, "for" is faster!

    else {
        for ($i = $strFirstFound + 1; $i < $arrLength; $i++) {
            if (is_string($arr[$i])) {
                $cur = strlen($arr[$i]);
                // echo("$i\t$cur\n");
                if ($cur == 0) { return $cur; } // 0 is the shortest possible string, so we can abort here.
                if ($cur < $min) { $min = $cur; }
            }
            // else { echo("$i\tNo string!\n"); }
        }
    }

    return $min;
}

function strCommonPrefixByStr($arr, $strFindShortestFirst = false) {
    $arrLength = count($arr);
    if ($arrLength < 2) { return false; }

    // Determine loop length
    /// Find shortest string in array: Can bring down iterations dramatically, but the function arrayStrLenMin() itself can cause ( more or less) iterations.
    if ($strFindShortestFirst) { $end = arrayStrLenMin($arr, true); }
    /// Simply start with length of first string in array: Seems quite clumsy, but may turn out effective, if arrayStrLenMin() needs many iterations.
    else { $end = strlen($arr[0]); }

    for ($i = 1; $i <= $end + 1; $i++) {
        // Grab the part from 0 up to $i
        $commonStrMax = substr($arr[0], 0, $i);
        echo("Match: $i\t$commonStrMax\n");
        // Loop through all the values in array, and compare if they match
        foreach ($arr as $key => $str) {
            echo("  Str: $key\t$str\n");
            // Didn't match, return the part that did match
            if ($commonStrMax != substr($str, 0, $i)) {
                    return substr($commonStrMax, 0, $i-1);
            }
        }
    }
    // Special case: No mismatch (hence no return) happened until loop end!
    return $commonStrMax; // Thus entire first common string is the common prefix!
}

function strCommonPrefixByChar($arr, $strFindShortestFirst = false) {
    $arrLength = count($arr);
    if ($arrLength < 2) { return false; }

    // Determine loop length
    /// Find shortest string in array: Can bring down iterations dramatically, but the function arrayStrLenMin() itself can cause ( more or less) iterations.
    if ($strFindShortestFirst) { $end = arrayStrLenMin($arr, true); }
    /// Simply start with length of first string in array: Seems quite clumsy, but may turn out effective, if arrayStrLenMin() needs many iterations.
    else { $end = strlen($arr[0]); }

    for ($i = 0 ; $i <= $end + 1; $i++) {
        // Grab char $i
        $char = substr($arr[0], $i, 1);
        echo("Match: $i\t"); echo(str_pad($char, $i+1, " ", STR_PAD_LEFT)); echo("\n");
        // Loop through all the values in array, and compare if they match
        foreach ($arr as $key => $str) {
            echo("  Str: $key\t$str\n");
            // Didn't match, return the part that did match
            if ($char != $str[$i]) { // Same functionality as ($char != substr($str, $i, 1)). Same efficiency?
                    return substr($arr[0], 0, $i);
            }
        }
    }
    // Special case: No mismatch (hence no return) happened until loop end!
    return substr($arr[0], 0, $end); // Thus entire first common string is the common prefix!
}


function strCommonPrefixByNeighbour($arr) {
    $arrLength = count($arr);
    if ($arrLength < 2) { return false; }

    /// Get the common string prefix of the first 2 strings
    $strCommonMax = strCommonPrefixByChar(array($arr[0], $arr[1]));
    if ($strCommonMax === false) { return false; }
    if ($strCommonMax == "") { return ""; }
    $strCommonMaxLength = strlen($strCommonMax);

    /// Now start looping from the 3rd string
    echo("-----\n");
    for ($i = 2; ($i < $arrLength) && ($strCommonMaxLength >= 1); $i++ ) {
        echo("  STR: $i\t{$arr[$i]}\n");

        /// Compare the maximum common string with the next neighbour

        /*
        //// Compare by char: Method unsuitable!

        // Iterate from string end to string beginning
        for ($ii = $strCommonMaxLength - 1; $ii >= 0; $ii--) {
            echo("Match: $ii\t"); echo(str_pad($arr[$i][$ii], $ii+1, " ", STR_PAD_LEFT)); echo("\n");
            // If you find the first mismatch from the end, break.
            if ($arr[$i][$ii] != $strCommonMax[$ii]) {
                $strCommonMaxLength = $ii - 1; break;
                // BUT!!! We may falsely assume that the string from the first mismatch until the begining match! This new string neighbour string is completely "unexplored land", there might be differing chars closer to the beginning. This method is not suitable. Better use string comparison than char comparison.
            }
        }
        */

        //// Compare by string

        for ($ii = $strCommonMaxLength; $ii > 0; $ii--) {
            echo("MATCH: $ii\t$strCommonMax\n");
            if (substr($arr[$i],0,$ii) == $strCommonMax) {
                break;
            }
            else {
                $strCommonMax = substr($strCommonMax,0,$ii - 1);
                $strCommonMaxLength--;
            }
        }
    }
    return substr($arr[0], 0, $s
0 votes
answered Jan 30, 2010 by sri

How about something like this? It can be further optimised by not having to check the lengths of the strings if we can use the null terminating character (but I am assuming python strings have length cached somewhere?)

def find_common_prefix_len(strings):
    """
    Given a list of strings, finds the length common prefix in all of them.
    So
    apple
    applet
    application
    would return 3
    """
    prefix          = 0
    curr_index      = -1
    num_strings     = len(strings)
    string_lengths  = [len(s) for s in strings]
    while True:
        curr_index  += 1
        ch_in_si    = None
        for si in xrange(0, num_strings):
            if curr_index >= string_lengths[si]:
                return prefix
            else:
                if si == 0:
                    ch_in_si = strings[0][curr_index]
                elif strings[si][curr_index] != ch_in_si:
                    return prefix
        prefix += 1
0 votes
answered Jan 6, 2011 by nanite

Short and sweet version, perhaps not the most efficient:

/// Return length of longest common prefix in an array of strings.
function _commonPrefix($array) {
    if(count($array) < 2) {
        if(count($array) == 0)
            return false; // empty array: undefined prefix
        else
            return strlen($array[0]); // 1 element: trivial case
    }
    $len = max(array_map('strlen',$array)); // initial upper limit: max length of all strings.
    $prevval = reset($array);
    while(($newval = next($array)) !== FALSE) {
        for($j = 0 ; $j < $len ; $j += 1)
            if($newval[$j] != $prevval[$j])
                $len = $j;
        $prevval = $newval;
    }
    return $len;
}

// TEST CASE:
$arr = array('/var/yam/yamyam/','/var/yam/bloorg','/var/yar/sdoo');
print_r($arr);
$plen = _commonprefix($arr);
$pstr = substr($arr[0],0,$plen);
echo "Res: $plen\n";
echo "==> ".$pstr."\n";
echo "dir: ".dirname($pstr.'aaaa')."\n";

Output of the test case:

Array
(
    [0] => /var/yam/yamyam/
    [1] => /var/yam/bloorg
    [2] => /var/yar/sdoo
)
Res: 7
==> /var/ya
dir: /var
0 votes
answered Jan 8, 2011 by seba-p

I would use a recursive algorithm like this:

1 - get the first string in the array 2 - call the recursive prefix method with the first string as a param 3 - if prefix is empty return no prefix 4 - loop through all the strings in the array 4.1 - if any of the strings does not start with the prefix 4.1.1 - call recursive prefix method with prefix - 1 as a param 4.2 return prefix

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...