0

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs.

Return a list of the best “good enough” matches.
word is a sequence for which close matches are desired (typically a string),
and possibilities is a list of sequences against which to match word (typically a list of strings).

[-] difflib is very slow if you going to work with large number of docs.

  • difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled
    number of matching characters divided by the total number of characters in the two strings.

  • Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity:

  • SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent
    in a complicated way on how many elements the sequences have in common. (from here)

  • Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance:

  • According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher).
    It supports only strings, not arbitrary sequence types, but on the other hand it’s much faster.
import difflib


def difflib_similarity_old(words1, words2):
    """
    return float() words similarity using `difflib.get_close_matches`
    with 1 is higgest score for this case.
    """
    list_words1 = words1.split()
    list_words2 = words2.split()

    score = 0
    similarity = difflib.get_close_matches

    for word in list_words1:
        if similarity(word, list_words2):
            score += 1

    return float(score) / float(len(list_words1))


def difflib_similarity(words1, words2):
    """
    return float() words ratio using `difflib.SequenceMatcher`
    """
    return difflib.SequenceMatcher(None, words1, words2).ratio()


if __name__ == '__main__':
    words1 = 'saya mencoba untuk lari sekencang mungkin'
    words2 = 'anda lari dari tiang dengan sangat kencang'
    print(difflib_similarity(words1, words2))  # 0.5

Refference:
- https://docs.python.org/3/library/difflib.html
- https://stackoverflow.com/a/11277680/6396981
- https://stackoverflow.com/q/6690739/6396981

python

Your Answer

blog comments powered by Disqus