M3E/OpenAi+vearch内容查重实践 | 京东云技术团队

站长

2023年09月20日 14:23 · 阅读数 83

一、实践背景介绍

1、业务背景

京东健康内容中台H2有一个目标就是需要替换两家CP内容（总体内容体量百万级），我们现在的逻辑是想按照PV热度优先高热去新生产和替换。替换后可以极大的节省cp内容引入的成本。

第一步：这么多内容，我们的生产逻辑需要按照学科和索引归类和分配，进而批量生产，靠人工一篇篇补索引，效率会很低。希望借助算法的能力，如果现在还不是非常准确，也可以算法+人工修正，

第二步：按索引归类好之后，我们和库内非CP但主题相似内容进行比对，已经有的就不做重复生产。最后剩下来的进行批量生产和替换。

2、技术背景

M3E（M3E（Multimodal Multitask Meta-Embedding）是一个开源的中文嵌入模型

Vearch 是对大规模深度学习向量进行高性能相似搜索的弹性分布式系统。也是京东自研开源的项目，具有强大的相似搜索的弹性分布式能力。

OpenAI的迅速发展对算法成本产生了重大影响。随着技术的进步和研究的不断推进，OpenAI已经取得了许多突破，使得算法的开发和部署成本大大降低。OpenAI的Chat模式和Embedding模式是OpenAI API中的两种不同的使用方式。

1、Chat模式： Chat模式是OpenAI API的一种使用方式，旨在支持对话式的人机交互。在Chat模式下，您可以通过向API发送一系列的用户消息来与模型进行交互，模型将逐条回复每个消息。这种交互式的方式使得您可以与模型进行对话，提出问题、请求解释、寻求建议等。

import openai

response = openai.Completion.create(
  engine="davinci",
  prompt="What is the capital of France?",
  max_tokens=100,
  n=1,
  stop=None,
  temperature=0.7
)

print(response.choices[0].text.strip())

2、Embedding模式： Embedding模式是OpenAI API的另一种使用方式，旨在获取文本的嵌入表示。在Embedding模式下，您可以将一段文本传递给API，并获取该文本的高维向量表示，也称为嵌入向量。这些嵌入向量可以用于计算文本之间的相似度、聚类、分类等任务。

import openai

response = openai.Embed.create(
  model="text-embedding-ada-002",
  documents=["Once upon a time", "In a land far, far away"],
)
embedding1 = response.embeddings[0]
embedding2 = response.embeddings[1]

# 进行嵌入向量的相似度计算等其它操作

本次实践主要使用了Embedding，具体实践如下文。

二、实践流程

1、总体流程

(1)、总体流程图

M3E/OpenAi+vearch内容查重实践 | 京东云技术团队

(2)、OpenAi/M3E向量生成部分代码实践

async def embed_and_store_with_limit_and_check(
        self, semaphore, id, vector_store,  text_future_func = None, text: Union[str, list[str]] = "", **additional_properties
    ):
        async with semaphore:
            retry_count = (
                3  # Task failed with exception Response payload is not completed
            )
            retry_count_doubled = False
            retry = 1
            last_error = None
            while retry <= retry_count:  # Retry up to 3 times.
                try:
                    try:
                        data = await vector_store.get(vector_id=id)
                        id = data.id
                        embedding = data.result.embedding.feature
                        return (id, embedding)
                    except VearchRouterGetNotFoundError:
                        try:
                            return await self.embed_and_store(
                                text=text,
                                id=id,
                                vector_store=vector_store,
                                text_future_func=text_future_func,
                                **additional_properties,
                            )
                        except asyncio.TimeoutError:
                            logger.error(
                                f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - Timeout during embed_and_store()"
                            )
                            raise
                except Exception as error:
                    error_message = f"{error}" or f"{error.__class__} {error.__doc__}"
                    logger.error(
                        f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - failed with exception {error_message}, retry {retry}"
                    )
                    if isinstance(error, VearchRouterStatusError):
                        if error.reason == "partition_not_leader":
                          logger.info(
                              f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - {error_message}, retry {retry} asyncio.sleep(10) doubled"
                          )
                          await asyncio.sleep(10)  # Response payload is not completed
                          if not retry_count_doubled:
                              retry_count = retry_count * 2
                              retry_count_doubled = True
                    if isinstance(error, aiohttp.client_exceptions.ClientPayloadError):
                        await asyncio.sleep(5)  # Response payload is not completed
                        if not retry_count_doubled:
                            retry_count = retry_count * 2
                            retry_count_doubled = True
                    else:
                        await asyncio.sleep(1)  # Wait for 1 second before retrying
                    retry = retry + 1
                    last_error = error

            raise VearchRouterClientRetryError(
                retry_count,
                f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - completely failed with exception {last_error} - retried {retry_count} times",
                error=last_error,
            )

(3)、vearch向量存储及相似度搜索部分代码实

async def score_similarity(
        self, vector_store, embedding=None, id=None, **search_properties
    ):
        """Find the most similar word and the similarity score for a given word in the document"""
        if not isinstance(embedding, list):
            try:
                results_with_scores = await vector_store.search_by_ids(ids=[id])
                # embedding = response.result.embedding.feature
                return results_with_scores.results[0].hits.hits
            except VearchRouterStatusError as error:
                raise error
                # if error.found == False:
                #   query_result = await embeddings.embed_query(word)

        results_with_scores = await vector_store.search(
            feature=embedding, **search_properties
        )

        return results_with_scores.hits.hits

2、OpenAi实现查重的局限性

(1)、成本

以目前100万数据量为例，如果使用目前OpenAi的开放接口实现，每篇内容由于token等限制进出一次需要0.007美元，100万篇内容需要7000美元才可以完成数据特征提取和向量生成，依照目前的内容体量和运用，这个成本还是高于预期，在成本方面没有比其他方案有优势。

(2)、效率

同样以100万数据为例，一篇内容特征提取和向量生成的时间由于国内各种限制，时间最快也在6-9s，即便是在并发以及多token的情况下，那100万内容执行完成最少也大于30天，这在实效性方面相比于其他方案也不占优势。

3、M3E模型引入

(1)、模型调研介绍

M3E（Moka Massive Mixed Embedding）是一个开源的中文嵌入模型，具有以下优势：

多模态支持：M3E模型能够同时处理多种模态的数据，如文本、图像、语音等。这种多模态的支持使得模型能够更好地处理复杂的现实场景，提供更全面的语义理解。

多任务学习：M3E模型支持同时学习多个任务，而不需要针对每个任务单独训练一个模型。通过共享模型的参数和特征表示，M3E能够将不同任务之间的知识相互传递和共享，提高学习效率和泛化能力。

元嵌入学习：M3E模型采用元学习的思想，通过在训练过程中模拟快速学习新任务的过程，使模型能够更好地适应新任务。这种元学习的能力使得M3E模型在面对新任务时能够从少量样本中快速学习并取得良好的性能。

中文语义理解：M3E模型专注于中文语义理解任务，具有针对中文语言特点的优化。这使得M3E模型在处理中文文本时能够更好地捕捉语义信息，提供更准确的嵌入表示。

开源和可定制性：M3E模型是开源的，可以根据具体需求进行定制和扩展。开放源代码使得用户可以自由地修改和优化模型，以适应不同的应用场景。

模型对比：

	参数数量	维度	中文	英文	s2s	s2p	s2c	开源	兼容性	s2s Acc	s2p ndcg@10
m3e-small	24M	512	是	否	是	否	否	是	优	0.5834	0.7262
m3e-base	110M	768	是	是	是	是	否	是	优	0.6157	0.8004
text2vec	110M	768	是	否	是	否	否	是	优	0.5755	0.6346
openai-ada-002	未知	1536	是	是	是	是	是	否	优	0.5956	0.7786

(2)、M3E选择的必要

a、实践过程中在不牺牲准确度的情况下向量维度长度短，节省存储空间和带宽，且在和vearch向量库结合使用的过程中发现768维度的向量生在查询和存储时表现的更优越。

b、模型非商业开源并且可以本地微调模型，有效结合业务场景进行

c、可以有针对性的根据数据规模和场景优化和分配资源，定时高效的达到业务预期效果目标。

d、兼容性，代表了模型在开源社区中各种项目被支持的程度，由于 m3e 和 text2vec 都可以直接通过 sentence-transformers 直接使用，所以和 openai 在社区的支持度上相当

e、使用场景主要是中文，少量英文的情况，建议使用 m3e 系列的模型，M3E 在大规模句对数据集上的训练，包含中文百科，金融，医疗，法律，新闻，学术等多个领域共计 2200W 句对样本，数据集详见M3E 数据集

f、模型持续优化中，开发过程中可以持续提高数据质量，后续可期待更加优秀的模型。

(3)、运用

pip3 install -i https://mirrors.jd.com/pypi/simple sentence-transformers==2.2.2

#### Download m3e-base
python3 -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('moka-ai/m3e-base'); print(model.encode(['Hello World!', '你好,世界!']))"

#### Save m3e-base to local path
python3 -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('moka-ai/m3e-base'); model.save('m3e-base-model/')"

代码示例：
async def embed (self, text_or_documents):
      if isinstance(text_or_documents, list):
          documents = text_or_documents
      else:
          """Split the text_or_documents, embed the documents and insert the embedding into the online vector storage"""
          text_splitter = LocalTextSplitter().get_instance() # self._tokenizer = spacy.load(pipeline)
          documents = text_splitter.split_text(text_or_documents)

      
      embedding_return = await self._async_***_with_****(documents=documents)
      if len(embedding_return) > 1:
          # Compute the mean vector
           **************

          # Normalizing the mean vector
           *************
      
      return embedding

向量生成示例：
{"_index":"content_gpt_db","_type":"content_space_m3e","_id":867602,"found":true,"_source":{"content_type":1,"embedding":{"feature":[0.04050827,0.021327972,-0.0051002502,0.017009735,-0.016672134,-0.01061821,0.026807785,-0.018224716,-0.03107071,-0.0053977966,0.043376923,0.028705597,0.004207611,-0.020687103,-0.0447731,-0.009578705,0.05571747,0.06632233,-0.051948547,-0.013450623,-0.032985687,-0.008350372,-0.043361664,-0.02400589,-0.019294739,-0.023269653,0.005455017,0.0059661865,0.008682251,-0.023887634,0.046310425,-0.036338806,-0.0020313263,0.0062503815,0.05295372,0.026079178,0.011068344,-0.028791428,0.029096603,0.030740738,0.026367188,0.052009583,-0.009216309,-0.004173279,0.0009822845,0.018190384,0.033262253,0.05126381,0.012481689,0.005584717,-0.011810303,0.35385132,-0.043067932,0.0099105835,-0.014457703,0.038978577,0.022174835,-0.039844513,-0.012966156,-0.011081696,0.009370804,-0.024477005,-0.01061058,0.0028133392,-0.009471893,-0.027820587,-0.041484833,0.011547089,0.009700775,-0.05132675,0.06669235,-0.06849289,-0.0129470825,0.004447937,0.074913025,0.008506775,-0.033031464,0.017101288,0.045627594,-0.009830475,0.02917099,0.030750275,-0.017490387,-0.016429901,-0.042669296,-0.014154434,0.0004749298,0.049741745,0.07151413,-0.012218475,-0.013538361,-0.016918182,0.016963959,-0.015842438,-0.03572464,-0.034015656,0.046806335,-0.001625061,-0.006690979,0.040275574,-0.035312653,0.008182526,-0.024295807,-0.047908783,0.023643494,0.054634094,-0.07056427,0.04160309,-0.014863968,0.00399971,0.025701523,-0.0082912445,-0.022632599,0.0016212463,-0.059513092,-0.022808075,-0.008533478,-0.052440643,0.037700653,-0.045360565,0.0012359619,0.06803894,-0.04005432,-0.02885437,-0.032421112,0.010250092,-0.0092430115,0.055828094,-0.05140686,-0.0019073486,0.012435913,-0.04206848,-0.08063507,-0.016105652,-0.00031280518,-0.005180359,0.002243042,-0.009155273,-0.044174194,-0.007598877,0.015665054,0.015577316,0.006883621,-0.031778336,-0.017795563,0.016918182,0.019405365,0.0077323914,-0.012916565,-0.007698059,0.031211853,-0.048286438,0.017166138,0.0033416748,-0.02381897,0.03614807,-0.014591217,0.06523514,-0.04491043,-0.05462265,0.029396057,0.03844452,0.011238098,-0.051124573,-0.024749756,0.0068511963,0.0137786865,-0.033081055,-0.0028033257,0.0011496544,-0.012090206,-0.013271809,-0.018554688,-0.019104004,-0.004699707,-0.11206055,0.007501602,0.0144023895,-0.019788742,0.028829575,-0.03552246,0.028182983,-0.027923584,0.014785767,-0.032590866,-0.0011997223,0.003458023,0.036985397,-0.012435913,-0.040542603,-0.034469604,-0.0028839111,-0.014625549,0.014442444,0.06880951,0.01688385,-0.044792175,-0.014442444,-0.01712799,0.024909973,0.036842346,-0.015365601,0.032600403,-0.023117065,-0.017802238,-0.011162758,0.021027565,-0.0071382523,0.0023880005,0.016410828,-0.07878876,-0.033210754,0.029317856,0.037729263,-0.013490677,0.01420784,-0.076553345,0.03074646,0.020904541,-0.016113281,-0.008716583,-0.058559418,-0.03612137,-0.029781342,-0.03557396,-0.026613235,-0.0034923553,0.033971786,0.01530838,0.019039154,0.05249405,-0.06877518,-0.05325699,-0.054332733,0.022380829,0.0017127991,-0.00060653687,0.003200531,-0.05033493,0.031169891,-0.027420044,0.07209778,0.03919983,0.023788452,-0.03340912,0.038368225,-0.011619568,-0.049583435,0.023187637,-0.031404495,0.001543045,0.011007309,0.03263092,0.0027999878,-0.029151917,-0.03868866,-0.01224041,-0.006829262,-0.014925957,-0.008881569,-0.0025873184,0.012497902,0.018328667,0.0066041946,-0.03035736,-0.0110321045,-0.03830719,-0.026245117,-0.03142929,-0.007991791,0.019321442,-0.021755219,-0.008829117,-0.050519943,0.010892868,-0.015569687,0.0134391785,0.02917862,0.00075912476,-0.09794235,0.011421204,0.04624176,0.066841125,-0.0044174194,-0.019325256,-0.0010528564,-0.03643036,-0.025726318,-0.014377594,-0.024211884,-0.03343582,0.020572662,0.027690887,0.0475502,0.03835678,-0.043956757,-0.00034713745,0.048107147,0.025608063,-0.014255524,0.028633118,-0.07511139,-0.048667908,0.0210495,0.06496048,0.013729095,-0.0051841736,0.016643524,-0.022533417,0.0012626648,0.034671783,-0.029605865,-0.011131287,0.0044937134,-0.065330505,0.019874573,-0.05259323,0.00045394897,-0.008098602,0.01354599,0.05250168,0.07034683,-0.0058631897,0.07423782,0.011419296,-0.037618637,0.01867485,0.000062942505,0.004085541,0.038211823,0.019878387,-0.0754509,0.0065402985,0.0045223236,0.030115128,0.0017757416,-0.014886856,-0.011007309,0.026533127,0.033769608,-0.051013947,0.035007477,0.05788803,-0.049877167,-0.037107468,0.0016613007,0.015481949,-0.02353859,-0.039718628,-0.04598999,-0.044052124,0.010528564,-0.028961182,-0.016166687,0.0015945435,-0.013336182,0.032533646,0.018568039,0.03763771,0.025045395,-0.052635193,-0.051948547,-0.062217712,0.08403778,0.0012397766,-0.0012321472,0.056552887,-0.027065277,0.04188156,-0.03208542,0.06875229,0.0647316,-0.013954163,-0.022972107,0.11660004,0.032203674,-0.031936646,0.0020599365,-0.020370483,-0.06651306,0.0062942505,-0.049430847,0.04660797,0.020118713,-0.031578064,-0.005180359,-0.053260803,-0.027565002,-0.031951904,-0.041366577,-0.0025939941,-0.008529663,0.012207031,-0.06890869,0.01940918,0.039123535,-0.008434296,0.033107758,0.0352211,0.020793915,0.0071353912,-0.028520584,-0.030920029,-0.008180618,0.070114136,-0.014175415,-0.0012359619,0.000045776367,0.08629227,-0.051700592,-0.07754135,-0.016498566,-0.015331268,-0.044864655,-0.04217148,-0.005420685,-0.008460999,-0.038154602,0.05747223,0.020240784,0.007413864,0.009027481,0.026922226,-0.018918991,0.012096405,0.04254532,-0.05728531,-0.010662079,0.02876091,-0.019536972,0.01614952,-0.0005931854,0.044952393,-0.00390625,0.02508545,0.03439331,0.008852005,0.022172928,-0.00008201599,-0.0032863617,-0.05140686,0.005859375,0.053024292,0.025146484,-0.019942284,-0.011334419,0.01258564,0.015990257,-0.02166748,0.036453247,0.039978027,-0.033798218,0.00076675415,-0.005138397,0.004749298,0.029026031,0.0323925,-0.025564194,0.025335312,-0.030546188,-0.04391861,0.018421173,-0.011249542,0.04883194,0.01543808,0.02312851,-0.032764435,-0.026203156,0.019647598,0.018751144,-0.009168625,0.048986435,0.015720367,0.021831512,-0.03219223,-0.026844025,0.0060043335,-0.026107788,-0.046318054,-0.04046631,0.035526276,0.0024375916,-0.05537033,-0.02425003,-0.04340744,-0.0066947937,0.0019111633,-0.019908905,0.0008430481,-0.038669586,-0.034023285,-0.0014533997,0.00793457,-0.045150757,-0.03302002,-0.020614624,-0.005558014,0.069065094,-0.039173126,-0.00825119,0.03167534,0.018571854,-0.006723404,0.015237808,-0.021053314,-0.016643524,-0.02035141,0.009143829,0.00017166138,0.04996872,0.08148575,-0.008792877,0.018224716,0.01874733,0.008649826,-0.026594162,-0.032094955,0.039243698,0.03283882,0.027730942,0.030176163,-0.04026985,0.015901566,0.033468246,0.013085365,-0.0065927505,0.011677742,-0.013127327,-0.02519226,0.04988098,-0.013015747,0.015609741,0.014896393,0.023586273,0.016117096,0.040584564,0.01984787,0.004398346,-0.0089530945,-0.03900528,-0.0024147034,0.037326813,-0.008106232,-0.052898407,-0.0038452148,-0.05821228,-0.02015686,-0.001739502,-0.013622284,-0.017688751,-0.05283737,0.020702362,-0.050605774,0.027381897,0.0316391,0.0024490356,-0.055805206,-0.056484222,0.023387909,-0.02993393,0.019495964,-0.012732506,-0.008210182,0.01850605,-0.04762268,0.081466675,0.005874634,-0.010238647,0.019134521,-0.004508972,-0.012359619,0.025794983,0.04028511,0.025411606,-0.03328514,0.0031719208,-0.01725769,-0.051498413,-0.035949707,0.010955811,0.008583069,0.06630707,-0.005821228,-0.0024795532,0.03709793,0.013637543,0.022525787,-0.06563187,0.053359985,0.0039367676,-0.060836792,0.04824829,0.027780533,0.03645134,0.013780594,0.02977562,0.017705917,-0.00057029724,-0.034914017,-0.019468307,-0.026908875,0.067222595,0.05558014,-0.021064758,0.031835556,-0.04665947,0.051054,-0.00028038025,0.029193878,0.003993988,-0.07110214,0.06306076,0.014007568,-0.01714325,0.035003662,-0.004722595,0.014993668,0.03897667,-0.023054123,-0.006303787,-0.017751694,0.002111435,-0.008413315,0.017080307,-0.06581879,-0.008491516,0.12903595,-0.006996155,0.05880356,-0.02943039,0.020183563,-0.018550873,0.06975937,0.03355789,0.03824997,0.04037857,-0.046398163,0.006954193,-0.029689789,0.029582977,0.07313156,-0.005428314,-0.045841217,-0.025279999,0.0048294067,0.013130188,0.059028625,0.022529602,0.031074524,-0.011817932,-0.0047683716,-0.014060974,0.031232834,-0.0031795502,-0.018915176,-0.015424728,0.04899597,-0.0131073,-0.023361206,-0.046707153,-0.012523651,-0.0008125305,0.08478165,-0.062747955,-0.026260376,-0.060684204,0.011657715,0.013763428,-0.009056091,0.05002594,-0.004814148,0.0046463013,-0.0072250366,-0.015556335,-0.037773132,0.0308609,0.012107849,0.032539368,0.03591156,-0.0512619,-0.048412323,-0.012073517,-0.005519867,-0.072574615,-0.041452408,-0.040891647,-0.017946243,0.019388199,0.018611908,0.028507233,0.041683197,0.019443512,-0.019191742,0.035518646,-0.017742157,0.07847214,-0.040740967,0.031051636,-0.035736084,0.010360718,0.03430748,0.008317947,0.044736862,-0.0071315765,-0.01648426,-0.008883476,-0.020913124,-0.005423546,-0.009973526,-0.02460289,-0.044252396,-0.032361984,0.054714203,0.00091934204,0.059459686,0.0034065247,0.06443405,-0.027736664,0.003993988,0.036701202,-0.035736084,0.018554688,0.029144287,-0.019836426,0.069698334,0.021060944,0.012462616,0.023517609,0.0021858215,0.02639389,0.031742096,-0.033161163,-0.034664154,-0.084918976,0.027759552,0.030056,0.00016021729,0.008415222,-0.02822113,0.084098816,-0.034959793,-0.024831772,0.020299911,-0.029752731,-0.044506073,0.004787445,0.017642975,0.01127243,0.055496216,0.01977539,-0.038375854,0.013122559,0.035747528,-0.003780365,-0.0005226135,-0.016674042,-0.045539856,-0.039131165,-0.024177551,0.0366745,-0.049545288,0.010528564,0.033737183,-0.04852295,-0.03115654,-0.049951553,-0.017721176,-0.00032234192],"source":""}}}

4、vearch数据库向量存储

(1)、vearch详细介绍

Vearch 是对大规模深度学习向量进行高性能相似搜索的弹性分布式系统。具有以下功能：

1、支持CPU与GPU两种版本。

2、支持实时添加数据到索引。

3、支持单个文档定义多个向量字段, 添加、搜索批量操作。

4、支持数值字段范围过滤与string字段标签过滤。

5、支持IVFPQ、HNSW、二进制等索引方式(HNSW、二进制方式4月下旬发布)。

6、支持Python SDK本地快速开发验证。

7、支持机器学习算法插件方便系统部署使用。

Vearch京东自研开源的项目，具有强大的相似搜索的弹性分布式能力。

(2)、向量存储

vearch_instance = VearchInstance(vearch_llm_instance=vearch_llm_instance)

import random
async def embed_content (      
      content_generator,
      concurrent_task_limit,
      vearch_instance,
      pbar
      ):
    semaphore = asyncio.Semaphore(concurrent_task_limit)

@handle_error_and_log
    @handle_client_response_type_check
    async def insert(
        self, db_name, space_name, vector_id, **vector_properties
    ) -> VearchRouterOperationResponse:
        if "feature" in vector_properties:
            properties = {**vector_properties}
            del properties["feature"]
            return await self.router.insert(
                db_name,
                space_name,
                vector_id,
                embedding={  # NOTE/FUTURE hard coded
                    "feature": vector_properties["feature"]
                },
                **properties,
            )
        return await self.router.insert(
            db_name, space_name, vector_id, **vector_properties
        )

(3)、相似度查询

查询语句：
http://jdh-content-gpt-vector-router.vectorbase.svc.ht09.n.jd.local/content_gpt_db/content_space_m3e/_search
{
    "query":{
        "ids":[
            580670
        ],
        "sum":[
            {
                "field":"embedding",
                "feature":[

                ]
            }
        ]
    },
    "retrieval_param":{
        "parallel_on_queries":1,
        "recall_num":100,
        "nprobe":80,
        "metric_type":"InnerProduct"
    },
    "is_brute_search":0,
    "online_log_level":"debug",
    "quick":false,
    "vector_value":false,
    "client_type":"leader",
    "l2_sqrt":true,
    "size":10
}

三、查重结果及M3E、OpenAi查重相似度效果比较

1、查重相似度验证结果展示


import asyncio
import aiofiles
import os
import openpyxl
import json
import sys
import re

# from langchain.document_loaders import TextLoader
# from langchain.schema import Document
import numpy as np
import aiohttp
import logging
import asyncio

# Get the directory containing the current file
current_dir = os.path.dirname(os.path.abspath(__file__))

# Get the parent directory (project root directory)
project_root_dir = os.path.dirname(current_dir)

# Add it to sys.path
sys.path.append(project_root_dir)

from shared.VearchInstance import VearchInstance

logger = logging.getLogger(__name__)

async def async_os_walk(root_dir):
    """A simple, async version of os.walk."""
    for root, dirs, files in os.walk(root_dir):
        for filename in files:
            yield root, filename


"""Main execution function"""
from shared.TerminalColor import bcolors

async def main():
    # from shared.VearchOpenAI import VearchOpenAI
    from shared.VearchM3e import VearchM3e
    vearch_instance = VearchInstance(VearchM3e)
    content_vector_store = vearch_instance.content_vector_store

    root_logger = logging.getLogger("")
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

    force_recreate_spaces = True
    await vearch_instance.client.prewarm()
    if force_recreate_spaces:
      await vearch_instance.ensure_empty()
    else:
      await vearch_instance.ensure()

    # Limit to 10 concurrent tasks.
    concurrent_task_limit = 16
    semaphore = asyncio.Semaphore(concurrent_task_limit)

    vearch_instance.log_configurations(
        "===Concurrency===",
        f"max concurrent requests: {concurrent_task_limit}",
        f"semaphore: {semaphore}",
        root_logger=root_logger,
    )

    # Define paths to the index and content data, and the label file
    new_content_to_analyze_dir = os.path.join(current_dir, "./data/content/")

    # Load the label data from the Excel file
    wb = openpyxl.load_workbook(os.path.join(current_dir, "./data/content.xlsx"))
    sheet = wb.active

    content_id_dict = {}

    async def process_row(row):
        preowned_article_body = row[0].value
        preowned_article_id = vearch_instance.generate_id(preowned_article_body)

        root_logger.info(
            "preowned_article_body: {}, preowned_article_id: {}".format(
                preowned_article_body[:20], preowned_article_id
            )
        )

        content_id_dict[preowned_article_id] = preowned_article_body[:50]

        return await vearch_instance.llm.embed_and_store_with_limit_and_check(
            semaphore=semaphore,
            text=preowned_article_body,
            id=preowned_article_id,
            vector_store=content_vector_store,
            content_type=vearch_instance.content_type_look_up("preowned_article"),
        )

    await asyncio.gather(*[process_row(row) for row in sheet.iter_rows()])

    # Asynchronously walk through every file in the root directory
    async for dirpath, filename in async_os_walk(new_content_to_analyze_dir):
        # Asynchronously build the search index for the document with filename in the dirpath
        content_file_path = os.path.join(dirpath, filename)
        match = re.search(r"(\d+)", filename)
        if match:
            content_id = int(match.group(1))
        else:
            content_id = vearch_instance.generate_id(content_file_path)

        root_logger.info("filename: {}, content_id: {}".format(filename, content_id))
        
        text = await vearch_instance.llm.load_file(file_path=content_file_path)
        embedding = await vearch_instance.llm.embed(text)

        # Asynchronously get the most similar texts and their similarity score for the label
        search_result = await vearch_instance.llm.score_similarity(
            embedding=embedding, vector_store=content_vector_store, min_score=-0.1
        )

        sorted_search_result = sorted(
            search_result, key=lambda hit: hit.score, reverse=True
        )

        for preowned_article in sorted_search_result:
            if preowned_article.id in content_id_dict:
              text = content_id_dict[preowned_article.id]
              root_logger.info(
                  f"{filename}: {bcolors.OKBLUE} score {preowned_article.score}{bcolors.ENDC}: ({bcolors.UNDERLINE}{text[0:100]}{bcolors.ENDC})"
              )

from shared.AsyncThread import start_asyncio_in_new_thread

# Running the main function using asyncio
if __name__ == "__main__":
    async_thread = start_asyncio_in_new_thread()
    async_thread.run(main())

2、M3E、OpenAi查重相似度效果比较

利用M3E和OpenAi不同模型提取的特征生成向量后计算的相似度基本上一致，且M3E提取的特征对中文的支持更好，更细化，导致最终计算分值以后也更加直观，能够快速的验证定位出相似度界限，对于内容查重业务更加友好，且在成本和效率上更具有优势。

四、总结

经过实践，本次处理47万篇内容，经过多轮优化，最终达到向量生成、验证及插入在使用规格配置32c50g的机器同时启用三个线程派发任务，32个进程共享内存的情况下，可在5小时内完成的。相似度搜索及存储到mysql可在20分钟内完成30万数据的处理。

OpenAI在算法研究方面的创新推动了成本的降低。通过引入更高效的算法和模型架构，OpenAI能够在相同的计算资源下取得更好的性能。这意味着开发者可以更快地训练和部署模型，减少了算法开发的时间和成本。但介于目前的技术环境及规则限制，选择一些开源的像M3E之类的模型才是更贴近我们目前的业务需求和日常使用。

利用M3E模型提取的特征对中文的支持也挺好，也更加细化，尤其除了基本的服务器和开发成本外在不需要额外的支出，效率也可以通过并发和增加资源的手段优化，成本和效率方面具有明显优势。768纬度的向量和vearch结合的也更优越。

作者：京东健康刘继帅

来源：京东云开发者社区转载请注明来源

转载自:https://juejin.cn/post/7280463767691051064