index partition

Xin · 2021 年3 月 14 日 22:22

老师你好，在 google search 的 index 设计里，你提到index partition 的策略是hybrid
我的理解是 index 是用 document parittion ，存放在内存里，如果内存里存不下，在把剩余的存在硬盘里，用的是term-partition。不知道这样的理解对不对？

logic · 2021 年3 月 15 日 07:19

基本正确，就提一点。
内存部分会使用分布式缓存，所以使用 document partition 之后没有绝对的存不下的问题。我们说不在内存中存储所有的 Inverted Index 数据是因为不常用的长尾 index 放在内存里意义不大，太贵。这些数据就只放在硬盘里就好。

m2d · 2021 年11 月 24 日 07:37

老师我也有类似的困惑。

如果不在内存中存储所有的inverted index数据，如果这时候用户搜索了一个长尾term，我们的搜索系统如何返回相关的document呢？

是否根据如下的流程：

系统收到search request with long tail term
the system fanout the term to all the distributed cache machines(document based partition)
Because the term is not in the distributed cache machines, the system examines the long tail Term and find out which disk machine to send the request to (according to term based partition).
System find related docs. Do NOT update the content in the distributed cache.

也就是说对于每个request，我们的搜索系统最坏情况下需要问分布式缓存，也需要问硬盘集群对吗。

logic · 2021 年11 月 25 日 01:03

对于不常见的term，放在硬盘里时蛮合理的。你说的流程是对的。注意一点：缓存在这里有两种用法，一种是作为 general purpose cache，一种作为 query cache。 General purpose cache 也就是存有重要常用的term的cache在搜索完成之后是不更新的。而 Query cache也就存有最近访问条目的cache是可以更新的。

chun · 2022 年5 月 17 日 04:59

请问怎么知道哪些term是不常见的？是取决于是不是有人搜索吗？这样说来是不是还需要一个job来把不常见的term从内存转到硬盘上去？

logic · 2022 年5 月 20 日 04:27

是取决于搜索。硬盘里有所有数据，而内存里只存常见的。不用job从内存往硬盘搬数据。

chun · 2022 年5 月 20 日 16:55

这里和我的理解有点出入，我看上面讨论听起来也是硬盘只存长尾的Term。不过硬盘全存似乎也没什么问题，那就这样吧！