Python 常用字符串操作

站长

2022年12月06日 14:02 · 阅读数 30

前言

在处理文本数据时，我们通常需要对其进行多种不同的操作，例如在文本后追加新的字符串、将文本拆分为多个字符串，或修改字母的大小写等；当然，除此之外，我们也会需要使用更高级的文本解析或其他方法；但是，将文本划分为句子或者单词、删除或替换某些特定单词等这类的操作是最常见的。

字符串操作

接下来，我们将通过一些实例来介绍常用的基本字符串操作。首先，定义一段文本，对其进行拆分，并进行一些常用的编辑，最后将编辑后的字符串连接在一起进行合并。

常用字符串操作

定义输入文本后，将其拆分为单个单词。文本拆分时以空格、换行符作为默认分隔符，使用split()方法可以将文本拆分为单个单词，单词中并不会出现空格、换行符或者其它指定的分隔符：

>>> input_text = 'Never regret falling in love with you. The longer you go, the more you cherish it. If time can flow back to the past, I must make a love song with you again, because you are the only one in my life.'
>>> words = input_text.split()
>>> words
['Never', 'regret', 'falling', 'in', 'love', 'with', 'you.', 'The', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.', 'If', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'I', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.']

用 “x” 字符替换句子中出现的大写字母。遍历每个单词的每个字符，对于每一个字符，如果它是一个大写字母，则返回一个 “x”。这一过程是通过两个列表推导完成的，一个在列表上运行，另一个在每个单词上运行，并通过条件语句进行判断仅在字符为大写字母时替换它们 —— 'x' if w.isupper() else w for w in word，最后将这些字符使用 join() 方法连接在一起：

>>> replaced = [''.join('x' if w.isupper() else w for w in word) for word in words]
>>> replaced
['xever', 'regret', 'falling', 'in', 'love', 'with', 'you.', 'xhe', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.', 'xf', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'x', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.']

对文本进行编码，将文本转换为纯 ASCII 编码格式，这在实际应用中十分重要，如果不进行合适的编码，在显示时会出现意料之外的错误。每个单词都被编码为 ASCII 字节序列，然后再次解码回 Python 字符串类型，并且在转换时使用 errors 参数来强制替换未知字符：

>>> ascii_text = [word.encode('ascii',errors='replace').decode('ascii') for word in replaced]
>>> ascii_text
['xever', 'regret', 'falling', 'in', 'love', 'with', 'you.', 'xhe', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.', 'xf', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'x', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.']

将单词进行分组，并且每组最多有 80 个字符，每一组作为一行。为所有以句点结尾的单词添加一个额外的换行符，作为不同组的标志，之后创建一个新行并逐个添加单词；如果一个行的单词超过 80 个字符，则会结束该行并开始一个新行，同样，当遇到一个换行符时，也会开始一个新行，我们还需要添加了一个额外的空格来分隔单词：

>>> newlines = [word + '\n' if word.endswith('.') else word for word in ascii_text]
>>> newlines
['xever', 'regret', 'falling', 'in', 'love', 'with', 'you.\n', 'xhe', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.\n', 'xf', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'x', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.\n']
>>> line_size = 80
>>> lines = []
>>> line = ''
>>> for word in newlines:
...     if line.endswith('\n') or len(line) + len(word) + 1 > line_size:
...             lines.append(line)
...             line = ''
...     line = line + ' ' + word

最后，将每一行按照标题形式进行格式化(每个单词的第一个字母大写)，并将它们连接为一段文本：

>>> lines = [line.title() for line in lines]
>>> result = ''.join(lines)
>>> print(result)
 Xever Regret Falling In Love With You.
 Xhe Longer You Go, The More You Cherish It.
 Xf Time Can Flow Back To The Past, X Must Make A Love Song With You Again,

其它字符串操作

除了上述操作外，可以对字符串执行的其他一些有用的操作。例如，字符串可以像任何其他列表一样使用切片，'love'[0:3] 将返回 lov。类似于 title() 方法，可以使用 upper() 方法和 lower() 方法，可以分别用于返回字符串的大写和小写版本:

>>> print('unicode'[0:3])
uni
>>> print('unicode'.upper())
UNICODE
>>> print('UNicode'.lower())
unicode