Python字符串匹配——正则严格匹配&fuzzywuzzy模糊匹配

北落师门XY

9604人浏览 · 2021-05-25 16:54:18

北落师门XY · 2021-05-25 16:54:18 发布

一、正则

正则简单介绍

菜鸟教程有个入门的教程：Python 正则表达式 | 菜鸟教程

match从起始位置匹配

import re
# match
string = 'hello world,she said.'
pat = r'he\S*'   # 匹配he开头的字符串
res = re.match(pat, string)
if res is not None:
    print(res.group())   # hello
else:
    print('empty!')

search匹配返回的第一个

string = '-hello world,she said.'
pat = r'he\S*'   # 匹配he开头的字符串
res = re.search(pat, string)
if res is not None:
    print(res.group())   # hello
else:
    print('empty!')

findall匹配返回的所有

string = '-hello world,she said.'
pat = r'he\S*'   # 匹配he开头的字符串
res = re.findall(pat, string)
print(res)   # ['hello', 'he']

compile 将一串字符串编译成正则表达式

可供match、search、findall使用

可匹配一段子串，见用法二

string = '-hello world,she said.'
pat = r'he\S*'   # 匹配he开头的字符串
# 用法一：等价于res = re.findall(pat, string)
res = re.compile(pat).findall(string)
print(res)   # ['hello', 'he']
# 用法二：可不匹配整个字符串string，而只匹配其中一段，与普通re.findall相比多了开始结束为止的入参
res = re.compile(pat).findall(string, 3, 20)
print(res)   # ['he']

span start end 返回匹配结果的索引

string = '-hello world,she said.'
pat = r'he\S*'   # 匹配he开头的字符串
res = re.search(pat, string)
print(res.span())   # (1, 6)
print(res.start())   # 1
print(res.end())   # 6

sub匹配后进行替换

string = '-hello world,she said.'
pat = 's.?e'   
res = re.sub(pat, 'Lily', string)   # 将符合pat的字符串替换为Lily
print(res)   # -hello world,Lily said.

split按匹配信息切割字符串，返回列表

string = '-hello world,she said.'
pat = 's.?e'
res = re.split(pat, string)   # 根据pat将字符串切割，进阶版的split
print(res)   # ['-hello world,', ' said.']

groups、 group（）、group（1）、group（2）

groups 以(group(1),group(2),...)的形式返回所有
group()与group(0)等同，返回匹配的整个字符串
group(1)返回匹配的第一个()中的内容
group(2) 返回匹配的第二个()中的内容

string = '-hello world,she said.'
pat = r'(he\S*) (w\S*)'   # 匹配he开头的字符串
res = re.search(pat, string)
print(res.groups())   # ('hello', 'world,she')
print(res.group())   # hello world,she
print(res.group(0))   # hello world,she
print(res.group(1))   # hello
print(res.group(2))   # world,she

？懒惰限定符

*?	重复任意次，但尽可能少重复
+?	重复1次或更多次，但尽可能少重复
??	重复0次或1次，但尽可能少重复
{n,m}?	重复n到m次，但尽可能少重复
{n,}?	重复n次以上，但尽可能少重复

string = '-hello world,she said.'
# pat = 's.*'
res = re.search('s.*', string)    # * 贪婪模型
print(res.group())   # she said.
res = re.search('s.*?', string)   # *?代表非贪婪模式、懒惰模式，最短匹配
print(res.group())   # s
res = re.search('s.?', string)   # ?代表匹配0个或1个
print(res.group())   # sh

界定符

?<= 前向肯定，前面必须是指定的字符
?<! 前向否定，前面不能是指定的字符
?= 后向肯定，后面必须是指定的字符
?! 后向否定，后面不能是指定的字符

string = ['12/13', '112/132', 'a12/13', 'a12/131']
for i in string:
    res = re.search(r'(?<=\d)\d{2}/\d{2}(?=\d)', i)   # 前向肯定，后向肯定，前后必须是数字
    # print(res)
    if res is not None:
        print(i, res.group())
    else:
        print(i, 'empty')

结果为：

12/13 empty
112/132 12/13
a12/13 empty
a12/131 empty

报错：look-behind requires fixed-width pattern

这个报错的意思是需要固定长度，比如:

'(?<!(母亲|父亲|.亲))(?:姓名|姓名)(?[:：\S]*)([\u4e00-\u9fa5]){1,2}' 合法

'(?<!(母亲|父亲|亲))(?:姓名|姓名)(?[:：\S]*)([\u4e00-\u9fa5]){1,2}' 不合法

{}指定匹配长度

{minlen, maxlen}
{minlen, } 等同于{minlen}

res = re.search(r'1\d{2,3}', '01234567')   # 匹配2-3个字符
print(res.group())   # 1234

finditer以迭代的形式返回匹配的所有；可重叠区域匹配

# 方法一：用finditer和多组括号
import re
a = '10013424.....'
b = re.finditer('(?=((1.*)))', a)
print([i.group(1) for i in b])

# 方法二：用regex库和overlapped参数
import regex
c = regex.findall('1.*', a, overlapped=True)
print(c)

# 结果均为['10013424.....', '13424.....']

统计一个字符串中有多少目标字符

import re
string = '萍萍喜欢在平安夜吃苹果'
tmp = re.findall('萍|平|苹', string)
print(tmp)   # ['萍', '萍', '平', '苹']
print(len(tmp))   # 4

[] 或

txt = '压缩后的图像尺寸不变'
res = re.search(r"[后厚猴].*[吃尺迟]", txt)
print(res)   # <_sre.SRE_Match object; span=(2, 7), match='后的图像尺'>
res = re.search(r"[后厚猴][的图像舒服发生阿娥]*[吃尺迟]", txt)
print(res)   # <_sre.SRE_Match object; span=(2, 7), match='后的图像尺'>

二、fuzzywuzzy模糊匹配

编辑距离

import Levenshtein
dis = Levenshtein.distance('100', '305')
print(dis)   # 2

process及fuzz方法

from fuzzywuzzy import process
from fuzzywuzzy import fuzz
txt = '10.0'
words = ['asd', '10。00', '10.00', '234']
# 返回列表中匹配的topN
top_2 = process.extract(txt, words, limit=2)   # 取相似性最大的2个
print(top_2)   # [('10。00', 89), ('10.00', 89)]

# 返回列表中匹配的top1,两个得分一样的取第一个
top_1 = process.extractOne(txt, words)
print(top_1)   # ('10。00', 89)

# 子串，部分匹配
a = [fuzz.partial_ratio(i, txt) for i in words]
print(a)   # [0, 75, 100, 0]

# 字符串，全文匹配
a = [fuzz.ratio(i, txt) for i in words]
print(a)   # [0, 67, 89, 0]

# 忽略顺序匹配，但前提是多个词，以空格进行分隔。所以明显是针对英文的
a = fuzz.token_sort_ratio('my name is Lily', 'Lily is my name')
print(a)   # 100

# 去重匹配，但前提是多个词，以空格进行分隔。
a = fuzz.token_set_ratio('i love summer ！', '！ ！ ！ i love summer ！ ！ ！')
print(a)   # 100

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m