实操 |LDBC 数据导入及 nGQL 实践

pczhy
发布于 2022-9-22 11:03
浏览
0收藏

实操 |LDBC 数据导入及 nGQL 实践-鸿蒙开发者社区

最近在自己搭的一个 Nebula Graph 单机集群中导入LDBC数据集,并尝试用 nGQL 写了几个 LDBC SNB 几个基础查询(Short Reads)。

LDBC 是一个致力于发展图数据管理的产业联盟组织,它开发了一系列 Benchmarks,方便企业比较图数据库产品。SNB 是基于社交网络场景和数据开发的一组 Benchmarks,当前由交互场景和 BI 场景组成。”

>>>>数据导入

Nebula bench(链接:https://github.com/vesoft-inc/nebula-bench/)这个  repo 有用 Python 包装好的生成和导入 LDBC 到 Nebula 的过程,基本上照着它文档中的步骤做就行。

遇见的几个小坑:

  • 运行 python3 run.py importer 后默认生成的 yaml 中默认设置了 space 的 replica 为 3,在我的单机集群下不能用。要么自己改 py,要么就只是 draft run 后,自己运行 nebula importer 来导数据,我机智的选择了后者;
  • 导入的数据中发现一些完全不同类型的点的 vid 是一样的,如 person 和 organization,这个在后面跑 nGQL 的时候会觉得有点怪。Nebula bench 的文档也有提到,因为不影响压测,没做处理,好吧。导入成功后,SHOW STATS 欣赏一下:

(root@nebula) [ldbc1]> show stats;
+---------+------------------+----------+
| Type    | Name             | Count    |
+---------+------------------+----------+
| "Tag"   | "Comment"        | 2052169  |
+---------+------------------+----------+
| "Tag"   | "Forum"          | 90492    |
+---------+------------------+----------+
| "Tag"   | "Organisation"   | 7955     |
+---------+------------------+----------+
| "Tag"   | "Person"         | 9892     |
+---------+------------------+----------+
| "Tag"   | "Place"          | 1460     |
+---------+------------------+----------+
| "Tag"   | "Post"           | 1003605  |
+---------+------------------+----------+
| "Tag"   | "Tag"            | 16080    |
+---------+------------------+----------+
| "Tag"   | "Tagclass"       | 71       |
+---------+------------------+----------+
| "Edge"  | "CONTAINER_OF"   | 1003605  |
+---------+------------------+----------+
| "Edge"  | "HAS_CREATOR"    | 3055774  |
+---------+------------------+----------+
| "Edge"  | "HAS_INTEREST"   | 229166   |
+---------+------------------+----------+
| "Edge"  | "HAS_MEMBER"     | 1611869  |
+---------+------------------+----------+
| "Edge"  | "HAS_MODERATOR"  | 90492    |
+---------+------------------+----------+
| "Edge"  | "HAS_TAG"        | 3721409  |
+---------+------------------+----------+
| "Edge"  | "HAS_TYPE"       | 16080    |
+---------+------------------+----------+
| "Edge"  | "IS_LOCATED_IN"  | 3073620  |
+---------+------------------+----------+
| "Edge"  | "IS_PART_OF"     | 1454     |
+---------+------------------+----------+
| "Edge"  | "IS_SUBCLASS_OF" | 70       |
+---------+------------------+----------+
| "Edge"  | "KNOWS"          | 180623   |
+---------+------------------+----------+
| "Edge"  | "LIKES"          | 2190095  |
+---------+------------------+----------+
| "Edge"  | "REPLY_OF"       | 2052169  |
+---------+------------------+----------+
| "Edge"  | "STUDY_AT"       | 7949     |
+---------+------------------+----------+
| "Edge"  | "WORK_AT"        | 21654    |
+---------+------------------+----------+
| "Space" | "vertices"       | 3165488  |
+---------+------------------+----------+
| "Space" | "edges"          | 17256029 |
+---------+------------------+----------+
Got 25 rows (time spent 1344/16017 us)

>>>>nGQL 查询

下面尝试解决LDBC SNB Interactive workload 中相对基础的几个查询场景,Short Reads,场景的需求可以具体看 spec。


Short Reads #1 - Profile of a person

match (v1:Person)-[:IS_LOCATED_IN]->(v2:Place) where id(v1)==$person_id
return v1.firstName, v1.lastName, v1.birthday, v1.locationIP, v1.browserUsed, id(v2), v1.gender, v1.creationDate


Short Reads #2 - Recent messages of a person

这里从 comment 找 post 需要支持不限跳数,目前 Nebula 尚不支持,只能指定一个足够大的上限,我随便设了 5。

match(p1:Person)<-[:HAS_CREATOR]-(m:`Comment`)-[:REPLY_OF*..5]->(p:Post)-[:HAS_CREATOR]->(p2:Person) 
where id(p1)==$person_id return id(m) as messageId, 
(case m.content is null when false then m.content when true then m.imageFile end) as content,
id(p),id(p2),p2.firstName,p2.lastName,
m.creationDate as creationDate order by creationDate desc, messageId desc limit 10;


Short Reads #3 - Friends of a person

match (p1:Person)-[k:KNOWS]-(p2:Person) where id(p1)==$person_id 
return id(p2) as friendId,p2.firstName,p2.lastName,k.creationDate as creationDate 
order by creationDate desc, friendId;


Short Reads #4 - Content of a message

终于可以不用 MATCH 了,这个简单的查询直接用 FETCH 搞定。

fetch prop on Post $message_id 
yield Post.creationDate, Post.content, Post.imageFile


Short Reads #5 - Creator of a message

同样不需要用 MATCH  GO 

go from 6605817 over HAS_CREATOR yield HAS_CREATOR._dst as personId, $$.Person.firstName, $$.Person.lastName;


Short Reads #6 - Forum of a message

继续 GO 。这里也涉及到无限跳数的问题, GO 同样不支持,我设了最大跳数 5。

go 0 to 5 steps from $message_id over REPLY_OF yield REPLY_OF._dst as postId 
| go from $-.postId over CONTAINER_OF REVERSELY yield CONTAINER_OF._dst as forumId, $$.Forum.title as title
| go from $-.forumId over HAS_MODERATOR yield $-.forumId, $-.title, HAS_MODERATOR._dst as moderatorId, $$.Person.firstName, $$.Person.lastName


Short Reads #7 - Replies of a message

这个场景看下来需要 Open Cypher 的 OPTIONAL MATCH来实现,Nebula 暂时还不支持,期待后续版本能加上。

分类
已于2022-9-22 11:03:35修改
收藏
回复
举报
回复
    相关推荐