HFNLP学习

深度学习中常用的技术,这里结合一些tutorial简单学习一下.

首先需要了解一些简单概念深度学习推荐系统 | Embedding，从哪里来，到哪里去 (zhihu.com)这里使用hugging face的相关库,我也建议多看看这个社区以及其相关的工具,我觉得这些工具很棒,这个开源社区也很棒.此外也有LangChain等工具.

transformers模型

transformers库是利用基于transformer模型处理任务的库.transfomers模型架构上有过很多大模型.

在hugging face中,有transformers库帮助我们处理一系列nlp任务.下面介绍一下这个库.

transformers库中最基本的对象.Transformers库是pipeline（）函数。它将模型与其必要的预处理和后处理步骤连接起来，使我们能够直接输入任何文本并获得可理解的答案.默认情况下，此pipeline选择一个特定的预训练模型，该模型已针对英语情感分析进行了微调。创建classifier对象时，将下载并缓存模型。如果您重新运行该命令，则将使用缓存的模型，无需再次下载模型。

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

输出

1 2	[{'label': 'POSITIVE', 'score': 0.9598048329353333}, {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

将一些文本传递到pipeline时，主要涉及三个步骤：

文本被预处理为模型能够理解的格式。
经过预处理的输入被传递给模型。
模型的预测是经过后处理的，因此您可以理解它们。

pipeline函数可用的任务参数有

feature-extraction (get the vector representation of a text)
fill-mask
ner (named entity recognition)
question-answering
sentiment-analysis
summarization
text-generation
translation
zero-shot-classification

由于NLP的一些任务我不是很熟悉,有必要做一些大概了解.

Zero-shot classification

我们需要对尚未标记的文本进行分类。这是实际项目中的常见场景，因为注释文本通常很耗时并且需要领域专业知识。对于这项任务zero-shot-classificationpipeline非常强大：它允许您直接指定用于分类的标签，因此您不必依赖预训练模型的标签

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

由于pipeline没有指定model和tokenizer,会默认下载model和tokenizer.这里模型本身携带了tokenizer.后者就是生成embedding的.

Text generation

模型将通过生成剩余的文本来自动完成整段话。这类似于许多手机上的预测文本功能。文本生成涉及随机性

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

from transformers import pipeline
# 指定模型
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Mask filling

这项任务的目的是填补给定文本中的空白：

from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

Named entity recognition

命名实体识别 (NER) 是一项任务，其中模型必须找到输入文本的哪些部分对应于诸如人员、位置或组织之类的实体

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Question answering

给定上下文中的信息回答问题

from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

Summarization

from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

Translation

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Pipeline

Preprocessing with a tokenizer

与其他神经网络一样，Transformer模型无法直接处理原始文本，因此我们pipeline的第一步是将文本输入转换为模型能够理解的数字。为此，我们使用tokenizer(标记器)，负责：

将输入拆分为单词、子单词或符号（如标点符号），称为标记(token)
将每个标记(token)映射到一个整数
添加可能对模型有用的其他输入

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Going through the model

我们可以像下载标记器一样下载我们的预训练模型。 Transformers提供了一个AutoModel类，该类还具有from_pretrained（）方法

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

这里说一下embedding现在常用的生成方式,这与transformer有关,关于Hugging Face的使用以及其教程后续继续更新.

Embedding

在语义上用数字表示,模型能够理解.

一句话如何embedding

简单的方法:单独地embed每个word,然后计算所有word的embedding的均值或和.但这样无法区分顺序.

现在常用的办法:使用transformer网络计算每个word的context-aware的表示,然后计算每个表示的均值.或者为每个token计算embedding而不是word.或者再利用transformer在自己的数据上继续训练,使得相似的句子相似度更高.

多模态嵌入也是比较新的东西,将图像与文本都嵌入到一个域内.比如CLIP模型.

Sekyoro的博客小屋