在多维立方体上的Postgresql k-最近邻(KNN)

发布时间：2020-12-13 18:07:17 所属栏目：百科来源：网络整理

导读：我有一个有8个维度的立方体.我想做最近邻居匹配.我对 postgresql完全不熟悉.我读到9.1支持多维上的最近邻匹配.如果有人能给出一个完整的例子,我真的很感激：如何使用8D立方体创建表格？样品插入查找 – 完全匹配查找 – 最近邻居匹配样本数据：为简单

我有一个有8个维度的立方体.我想做最近邻居匹配.我对 postgresql完全不熟悉.我读到9.1支持多维上的最近邻匹配.如果有人能给出一个完整的例子,我真的很感激：

>如何使用8D立方体创建表格？
>样品插入
>查找 – 完全匹配
>查找 – 最近邻居匹配

样本数据：

为简单起见,我们可以假设所有值的范围都是0-100.

第1点：(1,1,1)

第2点：(2,2,2)

查找值：(1,2)

这应该与Point1匹配,而不是Point2.

参考文献：

What’s_new_in_PostgreSQL_9.1

https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search

PostgreSQL支持距离运算符< - >据我了解,这可以用于分析文本(使用pg_trgrm模块)和 geometry数据类型.

我不知道如何使用它超过1维.也许您必须定义自己的距离函数或以某种方式将数据转换为具有文本或几何类型的一列.例如,如果您有8列(8维立方体)的表：

c1 c2 c3 c4 c5 c6 c7 c8
 1  0  1  0  1  0  1  2

你可以将它转换为：

c1 c2 c3 c4 c5 c6 c7 c8
 a  b  a  b  a  b  a  c

然后用一列表格：

c1
abababac

然后你可以使用(在创建gist index之后)：

SELECT c1,c1 <-> 'ababab'
 FROM test_trgm 
 ORDER BY c1 <-> 'ababab';

例

创建样本数据

-- Create some temporary data
-- ! Note that table are created in tmp schema (change sql to your scheme) and deleted if exists !
drop table if exists tmp.test_data;

-- Random integer matrix 100*8 
create table tmp.test_data as (
   select 
      trunc(random()*100)::int as input_variable_1,trunc(random()*100)::int as input_variable_2,trunc(random()*100)::int as input_variable_3,trunc(random()*100)::int as input_variable_4,trunc(random()*100)::int as input_variable_5,trunc(random()*100)::int as input_variable_6,trunc(random()*100)::int as input_variable_7,trunc(random()*100)::int as input_variable_8
   from 
      generate_series(1,100,1)
);

将输入数据转换为文本

drop table if exists tmp.test_data_trans;

create table tmp.test_data_trans as (
select 
   input_variable_1 || ';' ||
   input_variable_2 || ';' ||
   input_variable_3 || ';' ||
   input_variable_4 || ';' ||
   input_variable_5 || ';' ||
   input_variable_6 || ';' ||
   input_variable_7 || ';' ||
   input_variable_8 as trans_variable
from 
   tmp.test_data
);

这将为您提供一个变量trans_variable,其中存储了所有8个维度：

trans_variable
40;88;68;29;19;54;40;90
80;49;56;57;42;36;50;68
29;13;63;33;0;18;52;77
44;68;18;81;28;24;20;89
80;62;20;49;4;87;54;18
35;37;32;25;8;13;42;54
8;58;3;42;37;1;41;49
70;1;28;18;47;78;8;17

而不是||运算符您还可以使用以下语法(更短,但更神秘)：

select 
   array_to_string(string_to_array(t.*::text,''),'') as trans_variable
from 
   tmp.test_data t

添加索引

create index test_data_gist_index on tmp.test_data_trans using gist(trans_variable);

测试距离
注意：我从表格中选择了一行 – 52; 42; 18; 50; 68; 29; 8; 55 – 并使用稍微改变的值(42; 42; 18; 52; 98; 29; 8; 55)测试距离.当然,测试数据中的值将完全不同,因为它是RANDOM矩阵.

select 
   *,trans_variable <->  '42;42;18;52;98;29;8;55' as distance,similarity(trans_variable,'42;42;18;52;98;29;8;55') as similarity,from 
   tmp.test_data_trans 
order by
   trans_variable <-> '52;42;18;50;68;29;8;55';

您可以使用距离运算符< - >或类似功能.距离= 1 – 相似度

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!