Editorial for JOI '16 Open P2 - Selling RNA Strands

Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

Subtask 1 [10 points]

$N$ $N$ , $M$ $M$ , $|S_i|$ $| S_{i} |$ , $|P_j|$ $| P_{j} |$ , $|Q_j|$ $| Q_{j} |$ are small. We can solve this subtask by checking all RNA sequences for each order.

Subtask 2 [25 points]

Let's consider a simpler version of this problem:

There are $N$ $N$ RNA sequences $S_1, \dots, S_N$ $S_{1}, \dots, S_{N}$ and $M$ $M$ orders $P_1, \dots, P_M$ $P_{1}, \dots, P_{M}$ . For each $j$ $j$ $(1 \le j \le M)$ $(1 \leq j \leq M)$ , compute the number of $i$ $i$ such that the first $|P_j|$ $| P_{j} |$ characters of $S_i$ $S_{i}$ are $P_j$ $P_{j}$ .

We can store all of $S_1, \dots, S_N$ $S_{1}, \dots, S_{N}$ to a trie. Now we identify a string $X$ $X$ and the node of the trie which corresponds to $X$ $X$ . For any two strings $X, Y$ $X, Y$ , the following are equivalent:

$X$ $X$ is a prefix of $Y$ $Y$ , that is, the first $|X|$ $| X |$ characters of $Y$ $Y$ are $X$ $X$ .
Node $X$ $X$ is an ancestor of node $Y$ $Y$ .

Let $C(S)$ $C (S)$ be the number of occurrences of $S$ $S$ in $S_1, \dots, S_N$ $S_{1}, \dots, S_{N}$ . Then the problem is computing $\sum_{X:\text{predecessor of }P_j} C(X)$ $\sum_{X : predecessor of P_{j}} C (X)$ for each $j$ $j$ .

This can be done by precalculation with DFS, but here we use a different approach. Using the Euler-Tour technique, checking whether node $x$ $x$ is an ancestor of node $y$ $y$ can be reduced to checking whether a point (corresponding to $y$ $y$ ) is included in a range (corresponding to $x$ $x$ ). Now we get an algorithm for this problem:

Store $S_1, \dots, S_N, P_1, \dots, P_M$ $S_{1}, \dots, S_{N}, P_{1}, \dots, P_{M}$ to a trie.
Using the Euler-Tour technique, compute the range and the point for nodes in the trie. Note that we need not compute the values for nodes which don't correspond to any of $S_1, \dots, S_N, P_1, \dots, P_M$ $S_{1}, \dots, S_{N}, P_{1}, \dots, P_{M}$ .
For each $j$ $j$ , count the number of $i$ $i$ such that the point corresponding to node $S_i$ $S_{i}$ is included in the range corresponding to node $P_j$ $P_{j}$ .

Let's go back to the original problem. Now we have the constraint by $Q_j$ $Q_{j}$ as well as by $P_j$ $P_{j}$ . We saw that handling $P_j$ $P_{j}$ can be done by a trie and the Euler-Tour technique. It can be easily checked that the same approach can be used for handling $Q_j$ $Q_{j}$ if we consider $S_1^R, \dots, S_N^R, Q_1^R, \dots, Q_M^R$ $S_{1}^{R}, \dots, S_{N}^{R}, Q_{1}^{R}, \dots, Q_{M}^{R}$ (where $S^R$ $S^{R}$ is the reverse string of $S$ $S$ ).

Finally, we get an algorithm as follows:

Store $S_1, \dots, S_N, P_1, \dots, P_M$ $S_{1}, \dots, S_{N}, P_{1}, \dots, P_{M}$ to a trie.
Using the Euler-Tour technique, compute the range and the point for nodes in the trie.
Do the same for $S_1^R, \dots, S_N^R, Q_1^R, \dots, Q_M^R$ $S_{1}^{R}, \dots, S_{N}^{R}, Q_{1}^{R}, \dots, Q_{M}^{R}$ .
For each $j$ $j$ , compute the answer by checking two ranges.

Subtask 3 [25 points]

Let $\Sigma_S$ $Σ_{S}$ , $\Sigma_P$ $Σ_{P}$ , $\Sigma_Q$ $Σ_{Q}$ be $|S_1| + \dots + |S_N|$ $| S_{1} | + \dots + | S_{N} |$ , $|P_1| + \dots + |P_M|$ $| P_{1} | + \dots + | P_{M} |$ , $|Q_1| + \dots + |Q_M|$ $| Q_{1} | + \dots + | Q_{M} |$ , respectively.

We saw that this problem becomes relatively easy if we can ignore $P$ $P$ or $Q$ $Q$ . Now consider the following algorithm:

For each $j$ $j$ , extract only RNA sequences whose last $|Q_j|$ $| Q_{j} |$ characters are $Q_j$ $Q_{j}$ . Then compute the number of RNA sequences in the extracted ones whose first $|P_j|$ $| P_{j} |$ characters are $P_j$ $P_{j}$ .

The later part of the algorithm can be done by using a trie. The former part also can be done with a trie. This algorithm is not so efficient because it may be possible that all of the RNA sequences are extracted and thus stored to a trie for each $j$ $j$ . In this case, it works in $\mathcal O(M \Sigma_S)$ $O (M Σ_{S})$ . But we can add a (seemingly small) optimization:

Extract RNA sequences only once for the same $Q$ $Q$ .

In fact, it significantly reduces the complexity to $\mathcal O(\Sigma_S \sqrt{\Sigma_Q} + \Sigma_P + \Sigma_Q)$ $O (Σ_{S} \sqrt{Σ_{Q}} + Σ_{P} + Σ_{Q})$ .

Here is a proof. For each $i$ $i$ , $S_i$ $S_{i}$ is extracted only if $Q_j$ $Q_{j}$ is a suffix of $S_i$ $S_{i}$ (that is, the last $|Q_j|$ $| Q_{j} |$ characters of $S_i$ $S_{i}$ are $Q_j$ $Q_{j}$ ). Let $L$ $L$ be the set of such $Q_j$ $Q_{j}$ 's. Each string of $L$ $L$ has different length, so $\sum_{X \in L} |X| \ge 1 + \dots + \#L = \frac{(\#L+1)\#L}{2}$ $\sum_{X \in L} | X | \geq 1 + \dots + # L = \frac{(# L + 1) # L}{2}$ . Also $\sum_{X \in L} |X| \le \Sigma_Q$ $\sum_{X \in L} | X | \leq Σ_{Q}$ holds, so we have $\frac{(\#L+1)\#L}{2} \le \Sigma_Q$ $\frac{(# L + 1) # L}{2} \leq Σ_{Q}$ . Thus $\#L = \mathcal O(\sqrt{\Sigma_Q})$ $# L = O (\sqrt{Σ_{Q}})$ . It means that each $S_i$ $S_{i}$ is added to a trie $\mathcal O(\sqrt{\Sigma_Q})$ $O (\sqrt{Σ_{Q}})$ times. Adding $S_i$ $S_{i}$ to a trie once can be done in $\mathcal O(|S_i|)$ $O (| S_{i} |)$ . So adding strings to tries can be completed overall in $\mathcal O(\Sigma_S \sqrt{\Sigma_Q})$ $O (Σ_{S} \sqrt{Σ_{Q}})$ . Now we have tries, so computing the answer can be done by accessing node $P_j$ $P_{j}$ of a trie. This can be done in $\mathcal O(\Sigma_P)$ $O (Σ_{P})$ . Also, checking all of $Q_1, \dots, Q_M$ $Q_{1}, \dots, Q_{M}$ itself takes $\mathcal O(\Sigma_Q)$ $O (Σ_{Q})$ time.

After all, it was proved that this algorithm works in $\mathcal O(\Sigma_S \sqrt{\Sigma_Q} + \Sigma_P + \Sigma_Q)$ $O (Σ_{S} \sqrt{Σ_{Q}} + Σ_{P} + Σ_{Q})$ .

Subtask 4 [40 points]

The approach to the solution for Subtask 2 is reducing the problem to the following:

There are $N$ $N$ points $X_i, Y_i$ $X_{i}, Y_{i}$ and $M$ $M$ rectangles $[A_j, B_j] \times [C_j, D_j]$ $[A_{j}, B_{j}] \times [C_{j}, D_{j}]$ (whose axes are parallel to the $X$ $X$ or $Y$ $Y$ axis) on a 2D plane. For each rectangle, compute the number of points which are included in it.

If we ignore irrelevant nodes of tries, the coordinates of the points and the rectangles will be integers between $0$ $0$ and $N+M$ $N + M$ .

Let $K(A, B, C, D)$ $K (A, B, C, D)$ be the number of $i$ $i$ such that $A \le X_i \le B$ $A \leq X_{i} \leq B$ and $C \le Y_i \le D$ $C \leq Y_{i} \leq D$ . As the coordinates of the points are not less than $0$ $0$ , we have $K(A, B, C, D) = K(A, B, -1, D) - K(A, B, -1, C-1)$ $K (A, B, C, D) = K (A, B, - 1, D) - K (A, B, - 1, C - 1)$ . So we just need to compute $K(A_k, B_k, -1, D_k)$ $K (A_{k}, B_{k}, - 1, D_{k})$ for some queries $A_k, B_k, D_k$ $A_{k}, B_{k}, D_{k}$ efficiently. This can be done by the following algorithm:

Sort the points and the queries by $Y_i$ $Y_{i}$ (for points) or $D_k$ $D_{k}$ (for queries). If a point and a query have the same value, make the point appear earlier in the sequence.
Initialize an array $V_0, \dots, V_{N+M}$ $V_{0}, \dots, V_{N + M}$ with $0$ $0$ .
Scan the sorted sequence from the beginning and do the following:
- For a point $(X, Y)$ $(X, Y)$ , set $V_X \gets V_X+1$ $V_{X} \leftarrow V_{X} + 1$ .
- For a query $(A, B, -1, D)$ $(A, B, - 1, D)$ , answer $V_A + V_{A+1} + \dots + V_B$ $V_{A} + V_{A + 1} + \dots + V_{B}$ .

In this algorithm, we need to do array operations efficiently. This can be done with a Binary Indexed Tree.

Comments

There are no comments at the moment.