如何通过可能涵盖多个子标签的字符串来定位元素？

Question

我正在尝试根据已知文本字符串在文档中识别特定元素。通常，您可以轻松地使用以下代码实现这一目标：

soup.find(string=re.compile(".*some text string.*"))

但是，已知字符串中可能包含（多个）子元素。例如，如果这是我们的文档：

test_doc = BeautifulSoup("Title
Some text
")

并且我要查找一个特定元素。关于这个元素我唯一知道的是它包含文本 "Some text"。但我不知道其中的 "text" 是在子级 bold 标签内。

test_doc.find(string=re.compile(".*Some text.*"))

这段代码返回 None，因为 "text" 在子标签内。

在这种情况下，如何在不知道文本是否/如何被分割成子标签的情况下，返回包含所有子标签的父标签（在我的示例中是

标签）呢？

Marko Topolnik · Answer

面对不确定可能存在哪些或多少嵌套标签的情况，我的第一想法是使用带有伪类 :-soup-contains("some text") 的 CSS 选择器，但这可能过于宽泛，因为它会返回所有包含该文本的所有组合。

虽然这不是最优雅或最具弹性的方法，但或许可以从这里找到解决方案：每次选取包含所需文本的最小元素组合：

from bs4 import BeautifulSoup 

test_doc = BeautifulSoup("""

Title
Some text

    Some text different than 
before


""", 'html.parser')

selection = test_doc.select(':-soup-contains("Some text")')

for i, el in enumerate(selection):
    if len(selection[i].find_all()) < len(selection[i - 1].find_all()):
        del selection[i - 1]

print(selection)

运行上述代码将得到以下结果：

[Some text
, Some text different than 
before
]

另一种替代方案是，如果已知某些特定标签阻碍了原始方法，可以首先使用 unwrap() 方法移除它们。这也是 @Andrej Kesely 提问中询问特定标签的原因之一。

leppie · Answer

另一种解决方案，灵感来自@HedgeHog的回答：

from bs4 import BeautifulSoup

test_doc = BeautifulSoup(
    """Title
Some text
Some text different than 
before
""",
    "html.parser",
)


tags = test_doc.find_all(lambda tag: "Some text" in tag.text)
out = []
while tags and (t := tags.pop()):
    while tags and t in tags[-1]:
        tags.pop()
    out.append(t)

print(out)

输出结果：

[Some text different than 
before
, Some text
]

Bergi · Answer

以下是使用lxml和xpath处理预期文本可能被单个节点包含的情况的方法。

from lxml import etree
xml = """

    Title
    
        Some text
        Some another text
        Some text different than 
before

        Some text
    

"""
root = etree.fromstring(xml)

# 使用xpath表达式查找符合条件的元素
ele = root.xpath('//div[@id="target"]//*[(./text()="Some " and .//*[1]/text()="text") or ./text()="Some text"]')
print(ele)

在xpath表达式中，.//*[1]/text()="text"用于查找当前上下文节点的第一个后代节点是否包含预期的字符串。这个匹配是大小写敏感的，因此./text()="some "将找不到任何匹配项。

对于给定示例，输出结果如下：

[, , ]

从找到的元素中提取内容

print([[t for t in e.xpath('descendant-or-self::text()')] for e in ele])

输出结果：

[['Some ', 'text'], ['Some ', 'text', ' different than ', 'before'], ['Some text']]