HFNLP学习

深度学习中常用的技术,这里结合一些tutorial简单学习一下.

首先需要了解一些简单概念深度学习推荐系统 | Embedding,从哪里来,到哪里去 (zhihu.com)这里使用hugging face的相关库,我也建议多看看这个社区以及其相关的工具,我觉得这些工具很棒,这个开源社区也很棒.此外也有LangChain等工具.

transformers模型

transformers库是利用基于transformer模型处理任务的库.transfomers模型架构上有过很多大模型.

Architecture of a Transformers models

image-20230929145450463

在hugging face中,有transformers库帮助我们处理一系列nlp任务.下面介绍一下这个库.

transformers库中最基本的对象.Transformers库是pipeline()函数。它将模型与其必要的预处理和后处理步骤连接起来,使我们能够直接输入任何文本并获得可理解的答案.默认情况下,此pipeline选择一个特定的预训练模型,该模型已针对英语情感分析进行了微调。创建classifier对象时,将下载并缓存模型。如果您重新运行该命令,则将使用缓存的模型,无需再次下载模型。

1
2
3
4
5
6
7
8
9
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)

输出

1
2
[{'label': 'POSITIVE', 'score': 0.9598048329353333},
{'label': 'NEGATIVE', 'score': 0.9994558691978455}]

将一些文本传递到pipeline时,主要涉及三个步骤:

  1. 文本被预处理为模型能够理解的格式。
  2. 经过预处理的输入被传递给模型。
  3. 模型的预测是经过后处理的,因此您可以理解它们。

pipeline函数可用的任务参数有

  • feature-extraction (get the vector representation of a text)

  • fill-mask

  • ner (named entity recognition)

  • question-answering

  • sentiment-analysis

  • summarization

  • text-generation

  • translation

  • zero-shot-classification

    image-20230929151703649

由于NLP的一些任务我不是很熟悉,有必要做一些大概了解.

Zero-shot classification

我们需要对尚未标记的文本进行分类。这是实际项目中的常见场景,因为注释文本通常很耗时并且需要领域专业知识。对于这项任务zero-shot-classificationpipeline非常强大:它允许您直接指定用于分类的标签,因此您不必依赖预训练模型的标签

1
2
3
4
5
6
7
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)

由于pipeline没有指定model和tokenizer,会默认下载model和tokenizer.这里模型本身携带了tokenizer.后者就是生成embedding的.

Text generation

模型将通过生成剩余的文本来自动完成整段话。这类似于许多手机上的预测文本功能。文本生成涉及随机性

1
2
3
4
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
1
2
3
4
5
6
7
8
from transformers import pipeline
# 指定模型
generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)

Mask filling

这项任务的目的是填补给定文本中的空白:

1
2
3
4
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

Named entity recognition

命名实体识别 (NER) 是一项任务,其中模型必须找到输入文本的哪些部分对应于诸如人员、位置或组织之类的实体

1
2
3
4
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Question answering

给定上下文中的信息回答问题

1
2
3
4
5
6
7
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
question="Where do I work?",
context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

Summarization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
"""
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.

Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
"""
)

Translation

1
2
3
4
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Pipeline

The full NLP pipeline

Preprocessing with a tokenizer

与其他神经网络一样,Transformer模型无法直接处理原始文本, 因此我们pipeline的第一步是将文本输入转换为模型能够理解的数字。 为此,我们使用tokenizer(标记器),负责:

  • 将输入拆分为单词、子单词或符号(如标点符号),称为标记(token)
  • 将每个标记(token)映射到一个整数
  • 添加可能对模型有用的其他输入
1
2
3
4
5
6
7
8
9
10
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Going through the model

我们可以像下载标记器一样下载我们的预训练模型。 Transformers提供了一个AutoModel类,该类还具有from_pretrained()方法

1
2
3
4
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

这里说一下embedding现在常用的生成方式,这与transformer有关,关于Hugging Face的使用以及其教程后续继续更新.

Embedding

在语义上用数字表示,模型能够理解.

image-20231006173438026

一句话如何embedding

简单的方法:单独地embed每个word,然后计算所有word的embedding的均值或和.但这样无法区分顺序.

现在常用的办法:使用transformer网络计算每个word的context-aware的表示,然后计算每个表示的均值.或者为每个token计算embedding而不是word.或者再利用transformer在自己的数据上继续训练,使得相似的句子相似度更高.

多模态嵌入也是比较新的东西,将图像与文本都嵌入到一个域内.比如CLIP模型.

参考资料

  1. https://huggingface.co/learn/nlp-course
  2. https://learn.deeplearning.ai/google-cloud-vertex-ai
-------------本文结束感谢您的阅读-------------
感谢阅读.

欢迎关注我的其它发布渠道