Detailed Account of Complexity for Implementation of Circuit-Based Quantum Algorithms

In this review article, we are interested in the detailed analysis of complexity aspects of both time and space that arises from the implementation of a quantum algorithm on a quantum based hardware. In particular, some steps of the implementation, as the preparation of an arbitrary superposition state and readout of the final state, in most of the cases can surpass the complexity aspects of the algorithm itself. We present the complexity involved in the full implementation of circuit-based quantum algorithms, from state preparation to the number of measurements needed to obtain good statistics from the final states of the quantum system, in order to assess the overall space and time costs of the processes.


INTRODUCTION
Quantum computing takes advantage of the unique properties of quantum mechanics, such as superposition and entanglement to carry out computational tasks in distinct ways than the classical computers do [1]. Since Richard Feynman's idealization that a quantum architecture would be a proper way to simulate actual quantum systems that occur in nature in the early 1980's [2], much attention has been given to the application of quantum systems for computational tasks. Among the greatest and most famous achievements of quantum information and quantum computation, one can cite superdense coding [3], the BB-84 algorithm for quantum public key distribution of cryptography systems [4], Shor's integer factoring algorithm [5], Grover's database search algorithm [6], alongside examples of no less importance or relevance. The advances have also reached important areas of mathematics and natural sciences in general, with quantum algorithms and circuit designing being developed to accomplish linear algebra tasks like eigen- [7,8] and singular-value [9,10] decompositions of matrices, finding solutions to linear systems of equations [11], solving linear [12][13][14] and nonlinear [15] differential equations, partial non-homogeneous linear differential equations [16], among other potential applications.
There have been recent progress in the current era of Noisy Intermediate Scale Quantum (NISQ) devices, such as problems that cannot be solved by any classical shallow circuits in reasonable time, but turns out to be possible by shallow quantum circuits [17], quantum supremacy using a superconducting quantum processor architecture achieved by Google team [18], and also quantum advantages over classical computation using boson sampling [19] and the simulation of quantum systems by means of quantum based architecture in D-Wave systems [20].
In general, the implementation of a quantum algorithm is based on many steps, that involve data pre-processing, preparation of input quantum states, the processing of the input information through quantum gates and operations applied to the system, measurement of the final state of the composite quantum system, and post-processing of the data collected by the measurement process. In the present work, we will not deal with the pre-and post-processing steps, which are usually done by classical means. In most quantum algorithms, the quantum advantage over classical computation lies in the processing or evolution step, which takes advantage of the dimension of the Hilbert space of quantum systems and quantum parallelism to manipulate very large amounts of data, a task for which the present classical computers usually require exponential scaling resources, such as memory and state-of-the-art processors in supercomputer units. However, the preparation and measurement processes present in some quantum algorithms, which are essential for their proper implementations, are often neglected in their presentations, because of the intrinsic difficulties of these tasks.
The main purpose of this work is to perform a detailed analysis of the computational complexity defined by the space and time costs of quantum algorithms, considering all steps, from state preparation to readout processes. This work considers a scenario in which the rapid development of quantum computing has attracted the attention of people with different background, not only restricted to physicists or computer scientists from academia, but curious, investors, bankers, and entrepreneurs, which are delighted with the quantum speedups at first sight. Although quantum computing provides amazing results compared to its classical counterpart, a suitable interpretation of the algorithmic costs demands a proper analysis, which includes the circuit width, represented by the number of qubits necessary to carry on the tasks, as well as the circuit depth, which takes into account the number of quantum operations that must be implemented on the system for the proper processing of the information encoded in the qubit system. We are also concerned with the processes of recovering the resulting information of the processing, which can be represented by observable statistics or quantum tomography, depending on the task aimed by the quantum algorithm.
This work is organized as follows. In section 2 the costs of state preparation using different schemes are covered. Section 3 covers matrix and quantum gate decomposition and their complexity bounds. Section 4 considers quantum state tomography, with emphasis on the required number of measurements and repetitions of the execution of a quantum algorithm to achieve a desired accuracy in the results. In section 5, the overall complexity aspects for implementation are given, from state preparation to readout process. Finally, section 6 contains the conclusion of the work.

COMPLEXITY OF QUANTUM STATE PREPARATION
The need for preparation of quantum states as input for solving a given problem is a common task in many quantum algorithms implemented in the circuit model of Quantum Computation (QC) [1]. Such a preparation constitutes an important part in the process of implementation of a given algorithm for circuit gate-based quantum computing, as the final quantum state encoding the solution of the problem is directly linked to the input state through the evolution step. Thus, the complexity aspects of preparing the input state must be taken into account in a detailed resource analysis.
To describe the encoding of input states properly, we must split the entire quantum system that constitutes a quantum computer into two parts: the ancilla qubits, which are used, for instance, to encode relevant information and control logical operations, and the work system, that encodes the initial conditions of the problem to be solved, which is submitted to the evolution process defined by the quantum algorithm. For instance, consider the processes to encode the initial conditions for a linear differential equation [14] or for the HHL quantum linear problem [11] in the work system. The goal of state preparation is to initialize the system in a N-dimensional specific quantum superposition that is suitable to the problem to be solved on a quantum computer. This task is often accomplished by subroutines that, in quantum algorithms, are usually referred to as system encoding.
It is important to remark that there are different kinds of encoding, such as basis encoding and amplitude encoding: the former is often used when one needs to manipulate real numbers arithmetically, and the latter when one takes advantage of the large size of the Hilbert space to encode data as probability amplitudes [21]. As an example of basis encoding, let us see how a real number is encoded in a binary string. Suppose we must represent the real value vector x (−0.3, 0.6). The first digit on the binary string encodes the sign of the number, in which a 0 stands for "+", and a 1 for "−" signs. The floating point is located immediately to the right of the sign bit. This will lead to the state vector |x〉 |10100 01001〉 in basis encoding 1 . Note that this representation is approximate, subjected to an error ε in its representation, which depends on the number of precision qubits employed. The exact representation of a decimal basis number into the binary basis would require more or less bits, according to the number to be represented. In general, assuming that the composite system starts from the configuration |0〉 ⊗n , those circuits present depth 1, as only one NOT operation may be executed on each qubit in parallel, depending on the binary representation that must be encoded. Examples of circuits for basis encoding are presented in detail in Ref. [22]. Basis encoded states can be used, for instance, to solve prime factorization problems [23], in machine learning techniques [24], and to encode the solution of the computation by quantum annealers [25].
For amplitude encoding, the relevant information for computation is stored in the probability amplitudes of the quantum state. The process usually starts from the n-qubit state |0〉 ⊗n , which is submitted to a transformation like 1 A real number x ∈ [0, 2) can be represented in binary basis as x |0〉 ⊗n → |ψ〉 with N−1 i 0 |c i | 2 1, and each |i〉 corresponding to a given state vector of the N-dimensional computational basis, with N 2 n . To address this task, one must be capable of preparing such a superposition preserving coherence properties. The costs of preparing such input states have been discussed in the literature [26][27][28][29]. The generic superposition can be prepared from the state |0〉 ⊗n by the implementation of quantum gates that act directly upon the system to be prepared. These operations, and consequently, the cost of the procedure that aims to prepare a pure state, must be defined by the free parameters contained within |ψ〉, that is, a transformation |ψ〉 U|0〉 ⊗n , with O(Ñ) [21,30] gates, could be implemented, whereÑ corresponds to the number of free parameters. Since the numberÑ can be less than the total dimension of the system N, the process of preparing these bounded states can present a resulting cost that is cheaper than preparing the full upper bound case. Notice that, in the upper bound case, where |ψ〉 has 2 n free parameters,Ñ N 2 n . This is often the case with general systems of differential or linear equations, where the degrees of freedom of the quantum state must encode the initial values of the variables within the problem. Nevertheless, there are cases where the state vector defining the initial conditions for a system to be evolved or simulated are defined by sparse vectors or specially bounded initial conditions. For instance, one can consider the study of the behavior and properties of spin chains [31], where often each site of the chain starts from a ground state configuration or with a few qubits representing the excited states of spins in the chain. This procedure of initialization has the advantages of being based on operations that act directly on the work qubits, without the presence of any ancilla systems which would increase the circuit width, whose operations are entirely defined by the free parameters of the initial state |ψ〉. On the other hand, it requires a number of quantum gates which grows with the number of free parameters. Although these gates can be executed in parallel, in each qubit, this scheme is better implemented when the initial conditions encoded in |ψ〉 are given by sparse configurations or specially bounded vectors.
The state initialization can follow the procedure described in detail in [26], which makes use of standard single-and controlled k -operations, which are operations controlled by k qubits, acting on a single target. This method requires O(Nlog 2 2 (N)) single and two-qubit operations in total for executing a transformation like 1) without the introduction of additional quantum bits. One should also take notice of the presence of controlled k -operations, that can be further decomposed into O(k 2 ) single and two-qubit quantum gates [32]. The particular structure of these controlled operations increases the depth of its action throughout the components of the quantum system [26]. Soklakov and Schack presented a quantum algorithm [33] to prepare an arbitrary quantum register based on the Grover's search algorithm requiring resources that are polynomial in the number of qubits and additional gate operations.
As an example of state preparation, the Divide-and-Conquer scheme [34] presents an algorithm for amplitude encoding in the form of a superposition like in which the qubits of the work and ancilla systems are entangled. So, although the system is prepared in a superposition state, the results after observation of ancilla qubtis will be left the work system as a mixed density matrix, what, in the case of algorithms for solving systems of linear or differential equations, this could be a disadvantage. Nevertheless, the algorithm is useful for machine learning and statistical analysis, and other applications, such as data sorting [34]. The algorithm structure presents the idea of dividing a problem into subproblems of the same class. The idea for creating the quantum superposition is to divide the problem like the scheme presented in Figure 1. The algorithm is based on the circuit model for quantum computing, which are presented in detail in [34], and presents space and time costs that scales as O(N) and O log 2 2 N , respectively. The circuit for implementation of the Divide-and-Conquer algorithm for state preparation presents polylogarithmic depth and has a simplified structure, with the tasks divided into problems of the same class. It also presents the advantage of being based on the circuit model of computation, making its implementation simple as a subroutine for the main algorithm just by including the corresponding circuit in the state preparation step. However, this polylogarithmic depth comes at the cost of increasing the circuit width, as ancilla qubits are necessary to carry on its implementation. Thus, one can observe a tradeoff between gate counts and number of qubits playing a significant role for this scheme.
Another state preparation scheme usually mentioned in quantum algorithms involves accessing a quantum database in which the quantum states are prepared in advance and can be quickly transferred to the working qubits. Below we describe this scheme in more detail, paying special attention to its complexity. FIGURE 1 | Schematic representation of the Divide-and-Conquer algorithm for loading a four-dimensional vector x into a quantum state. The task of preparing |x〉 is divided into subtasks and can be represented as the logic tree shown above. Adapted from [34].

Quantum Database and Quantum Random Access Memory
Employing calls on Random Access Memory (RAM) devices is an approach that aims to accomplish the task of preparation of quantum states by querying a database that contains the information of interest. For the purpose of querying a memory device with relevant information about the input state, one must be able to construct a database which consists in a set of state vectors containing the information for quantum computation. For instance, suppose a set of m vectors S {ψ 1 , ψ 2 , . . . , ψ m }, each of them containing k components. The quantum equivalent of this database is the quantum associative memory representation [35] given by the uniform superposition of each state vector [21].
The cost for the creation of |S〉 scales as O(mk) [21,35]. Assuming that each |ψ i 〉 can be considered as a qubit system with dimension k N 2 n , this would require O(mN) steps, which grows linearly (quadratically) with N in the best (worst) case. Grover's quantum search algorithm is often used as subroutine for querying databases with complexity O m √ log 2 (m) steps, while preparing and processing results of the query process would take Ω(m log 2 (N)) steps [6].
There are other architectures for the implementation of quantum random access memory, such as the "Bucket Brigade" (BB) [36] and the Flip-Flop qRAM [37], which make use of different schemes to retrieve the content of a memory cell coherently. The BB architecture, for instance, is composed of a series of three-level quantum systems (qutrits), described by the states |•〉, | ← 〉 and | → 〉, which are used to guide a bus signal to the corresponding memory cell. A scheme to access a memory cell addressed by a 3-bit string is shown in Figure 2. In this architecture, each qubit in the address register is sequentially sent into the subsequent levels of the binary tree. These qubits then interact with the corresponding three-level system, whose initial state |•〉 is changed to | ← 〉 or | → 〉, depending on the address qubits. The three-level systems then act like a routing system which is used to guide a bus signal to the addressed memory cell. In this process, the state of the address qubits becomes entangled with the position state of the bus. The content of the cell is then transferred to the internal degrees of freedom of the bus signal by means of CNOT operations, whose number corresponds to the internal degrees of freedom that must be encoded. The signal is then sent backwards towards the path, and its position state FIGURE 2 | Schematic representation of the BB architecture for a eight states qRAM. To address the memory cells only 3 log 2 (8) are needed. The nodes of the tree are composed by qutrits, which are initially in the wait state. The bit string determines the path to be followed by the bus signal, in which 0 means left path and 1 right path. Depending on the bits of the given string, the states of the qutrits are left in | ← 〉 or | → 〉, and follows to the next level. Adapted from [38]. After returning by the same path to the beggining of the tree, the states return to the wait states.
FIGURE 3 | Quantum circuit corresponding to one Flip-Flop iteration of the FF-qRAM algorithm. The classically-controlled operations X are applied to the states |ψ j 〉, and the register |0〉 R can include the probability amplitudes for encoding. Note that the complete superposition creation requires the complete circuit implementation. Adapted from [37]. The results can also be returned in a Ndimensional superposition form, if the bit string for addressing is given by a state of n qubits in superposition. The introduction of qutrit systems also has the effect of increasing the width of the circuit, as more quantum systems are introduced for its implementation. The architecture also presents the characteristic of not being suitable for quantum correction algorithms, as for the implementation of these, all the qutrits in the system would be activated, and this would make it equivalent to the usual FANOUT RAM architecture [36,37]. Possible physical implementations of the BB architecture can be realized in quantum optical and solid state systems [36].
The Flip-Flop qRAM (FF-qRAM) [37] scheme has the advantage of being based on the circuit model for quantum computation, and thus can be implemented as a subroutine in the state preparation step of a quantum algorithm to generate a quantum database by just adding the circuit to the state preparation step. The circuit for one Flip-Flop iteration is shown in Figure 3. The operation executed by the complete circuit has the effect [37].
where |d (l) 〉 encodes the string of the vector, and |θ (l) 〉 R cos(θ (l) )|0〉 R + sin(θ (l) )|1〉 R represents the information about the amplitudes of encoding in superposition of the register qubit R. In this scheme, the CNOT operations applied to the qubits in the basis vectors |ψ j 〉 are classically controlled by the corresponding bits d (l) i . The gate denoted in the circuit by θ (l) denotes a rotation on the register qubit to associate the probability amplitude to the qubits in the database. Note that the database qubits |ψ j 〉 can be in an arbitrary basis state, and the circuit has the effect of applying the controlled rotation θ (l) only if the database state |ψ j 〉 matches the bit string d (l) d (l) 0 d (l) 1 . . ., thus only associating the amplitude with the corresponding bit string.
According to Ref. [37], the costs of space and time amounts to O(log 2 (N)) qubits and O(m log 2 (N)) multi-qubit operations for creating superpositions of basis states with specific probability amplitudes on a quantum database such as represented by Eq. 1. The information can also be read and updated through repeated iterations of the Flip-Flop scheme. It has the advantage of not depending on proper routing algorithms, as it happens with the conventional and BB qRAM architectures [36], and is based on the quantum circuit computation model, what makes possible the application of quantum error-correction routines [37,[39][40][41]. The major disadvantage of the FF-qRAM architecture is the requirement of multi-controlled qubit rotations, whose cost can surpass the entire complexity of implementation for the whole FF-qRAM circuit, as the decomposition of such an operation can increase considerably the depth of the corresponding quantum circuit (see Section 3), depending on the architecture of the hardware in which it must be implemented.
In Table 1, the space and time costs for the preparation schemes are summarized. The BB based architecture for qRAM presents polylogarithmic time costs, as well as the Divide-and-Conquer algorithm, but needs O(N) qutrits (represented in brackets), although only O(log 2 (N)) of these qutrits are activated during the process, and a proper routing algorithm, together with the O(log 2 (N)) address qubits for routing the bus signals to the corresponding the memory cells.

GATE DECOMPOSITION COMPLEXITY BOUNDS
Gate decomposition consists in the task of writing general operators that act upon a n-qubit system in the form of simpler gates that can be implemented in a quantum computer. For this purpose, different approaches and techniques have been developed, such as cosine-sine decomposition (CSD) [42], QR decomposition [43], 2 the Khaneja-Glaser decomposition (KGD) [44] among other methods with no less relevance.
In general, an arbitrary n-qubit gate U is represented by a N × N matrix, with N 2 degrees of freedom, that can be written as a product of O(N 2 ) two-level unitary operations. To achieve such a decomposition, one can make use of a set of universal gates for computation, i.e., a set of one-and two-qubit operations from which any arbitrary operator U can be decomposed. For instance, it is known that the set of single-qubit and CNOT gates is universal [1]. With respect to the complexity regarding the implementation of U in terms of this universal set, the theoretical lower bound amounts to 1 4 (N 2 − 3 log 2 (N) − 1) CNOT operations [30].
Different approaches of circuit designing for gate decomposition are available in the literature. In particular, using the QR approach, the decomposition of U results in a quantum circuit with gate cost that amounts to O(N 2 log 3 2 (N)) elementary operations [32]. Ref. [45] shows a circuit build in which the CSD method is recursively applied together with uniformly controlled operations, resulting in a cost of N 2 − 2N CNOTs and N 2 elementary single-qubit operations for implementing U. In [46], it is presented a circuit based on the use of Gray Codes [47], whose complexity bounds matches asymptotically the theoretical lower bound by reducing the gate cost from O(N 2 log 2 (N)) to O(N 2 ) by elimination of superfluous control qubits from the corresponding quantum circuit.
Although the lower bound of CNOT gates for implementing an arbitrary U has an exponential cost in terms of the number of qubits n, it is possible to reduce the depth of a CNOT based circuit by the realization of a space-depth trade-off. This technique consists in the use of additional ancilla qubits, thus increasing the width of the quantum circuit, to parallelize the CNOT operations that must be realized throughout the circuit to implement the generic n-qubit gate U. The ideia was first demonstrated in [48], where it is proved that making use of O(n 2 ) ancilla qubits, a n-qubit CNOT circuit can be parallelized to O(log 2 (n)) depth. It has been also already proved that each nqubit CNOT circuit can be synthesized with O( n 2 log 2 (n) ) CNOT gates [49]. These results were recently improved [50], showing that it is possible to reduce the number of ancilla qubits presented in [49] by a factor of log 2 2 (n), resulting that m ( n 2 log 2 2 (n) ) auxiliary qubits suffice to build O(log 2 (n))-depth circuits, and also, to reduce the depth presented in [49] by a factor of n, thus achieving the asymptotically optimal bound of O( n log 2 (n) ). This optimization in space-depth trade-off is summarized in the following way [50]: For any integer m ≥ 0, any n-qubit CNOT circuit can be parallelized to O(max log 2 (n), n 2 (n+m)log 2 (n+m) ), with m standing for the number of ancillas in the composed system. Thus, besides the exponential complexity of decomposing arbitrary n-qubit unitary operators, the space-depth trade-off presents an alternative in optimizing the circuit synthesis. Nevertheless, it is worth to consider that this parallel approach requires additional qubits to make the trade-off, having the immediate effect of increasing the circuit width of a quantum algorithm. It is also worth noting that different architectures for quantum computing may present different sets of basic gates in which the quantum operations must be decomposed, and also other different important aspects, such as connectivity, making the costs of decomposition and implementation of gates also dependent on the architecture of the quantum computer.

COMPLEXITY OF QUANTUM STATE TOMOGRAPHY
Quantum state tomography (QST) is a procedure that aims for the complete reconstruction of an unknown density matrix ρ [1]. Often, for information encoded in amplitudes or phases of a quantum state, after executing a quantum algorithm, one is presented with a density matrix whose elements (ρ ij ) codify the algorithm's output [51]. Information encoded in the complex amplitudes of a quantum state is not directly accessible through trivial means [1]. Thus, QST could represent a fundamental step in the knowledge of obtaining the full solution of a given problem. This consideration is important for a proper comparison between quantum and classical algorithms in which the quantum solution is a superposition state while the classical solution is a vector where all coefficients are known [52]. At the same time quantum information can be stored in a Hilbert space whose dimension increases exponentially according to the number of qubits. To retrieve such information it is necessary to pay the price for that, which also requires exponential steps. Alternatively, some global properties of the solution could be obtained by means of the expectation values of some observables, i.e., O k Tr O k ρ [51]. This later approach usually conducts to quantum advantage in the processing time, however, it is not straightforward to get from average values of observables the desired quantities usually employed in practical applications of quantum computing. In this way, the impact of the QST complexity on the overall costs of quantum algorithms must be carefully considered.
There is a variety of QST processes and schemes available to accomplish the characterization task, such as Simple Quantum State Tomography (SQST) [1], Ancilla Assisted Process Tomography 3 (AAPT) [63], QST via Linear Regression Estimation [64], Compressed-Sensing QST [65], Principal Component Analysis [66], efficient process tomography [67] and permutationally invariant tomography schemes [68,69], each of these with particular complexity aspects, being suitable for specific problems. Their different computational costs arise from taking advantage of particular characteristics of ρ.
In general, QST is based on the decomposition of the density matrix in a linear combination of basis operators. For a system of n qubits, the reconstruction of a density matrix ρ in such space requires 4 n − 1 N 2 − 1 basis operators [1], which scales polynomially in the dimension O(N 2 ). These exponential aspects of complexity are well known [70]. Besides the number of basis operators needed for characterization, it is important to remind that the reconstruction of ρ is based on expectation values of those basis operators. For instance, in the case of a single qubit, the set of 4 1 − 1 3 basis operators needed for the proper quantum statistics could be based on the Pauli matrices X, Y, and Z, such that ρ 1 2 Tr(ρ)I + Tr(ρX)X + Tr(ρY)Y + Tr(ρZ)Z , (5) 3 Although Ref. [63] discusses quantum process tomography, a QST procedure is needed in order to complete the protocol in SQPT and AAPT schemes, and an insight about the complexity of quantum state tomography can be obtained. where I is the identity operator. This statistical approach requires ensemble measurements of these observables, thus requiring a large number of copies of ρ [1]. Besides these fundamental concepts, it has been shown that by using machine learning theory one could learn information about ρ by a number of measurements that grow linearly with n [71]. Ref. [51] gives a detailed description of the number of measurements and the scaling of the physical resources of the system. There are also models in which the QST problem is converted into a parameter estimation problem such as linear regression [72], for which the computational complexity scales as O (N 4 ).
The overall costs of implementation 4 yielded from SQST is O(N 4 log 2 (N)), and the same relation holds for AAPT using Joint Separable Measurement (JSM) scheme. Both SQST and AAPT-JSM require only single body interactions [51], while the Mutually Unbiased Bases (MUB) and the generalized POVM AAPT-schemes require many-body interactions. The costs for MUB scale as O(N 2 log 2 2 (N))[O(N 2 log 3 2 (N))] under presence of nonlocal [local] two-body interactions, and the POVM scheme as O(N 4 ) measurements on a single copy of the density matrix. The particular aspects of complexity of these schemes of tomography must take into account the required type of interactions between qubits, as nonlocal interactions may be not available in all architectures for quantum computation, which would represent a difficulty for its implementations. It is also worth noticing that AAPT-based schemes require the presence of ancillary systems, which, in practice, have the effect of increasing the system width. SQST has the ability of characterizing the full density matrix of a quantum system, including all probabilities and relative phases, but with a cost exponentially large with respect to the number of qubits that compose the system, making its implementation impractical to characterize output states of circuits with large width of the work system. The Quantum Principal Component Analysis (QPCA) [73], widely applied in machine learning techniques, focuses on reconstructing the eigenvectors of ρ corresponding to the largest eigenvalues of the system in a particular region of the space H, in time O(R log 2 (N)). The full density matrix reconstruction can also be realized with QPCA process, in a number of time steps that amounts to O(RN log 2 (N)) [73]. Compressed-Sensing, in contrast, reconstructs the full density matrix of the system in O(RNlog 2 2 (N)) time steps [74]. In particular, the basic idea of Compressed-Sensing is that a low-rank density matrix can be estimated with fewer copies of the state, as the sample complexity depends on its rank R. Ref. [71] introduces the matrix Dantzig selector and matrix Lasso estimators, with sample complexity for obtaining an estimate accurate within ε in trace distance scaling as O( R 2 N 2 ϵ 2 log 2 (N)) for rank-R states, requiring measuring of O(RNpolylog(N)) Pauli expectation values. Finally, in the case where the final density matrix of the work qubits ends up in a state which is permutationally invariant (PI), the tomographic method presented in [68,69] requires only O(log 2 2 (N)) operations. If the density matrix is not perfectly invariant under qubit permutation, the method still provides a satisfactory result at least for those cases where the order of the qubits is not relevant. The PI method is best suited for the tomography of systems which present symmetric quantum states, like Dicke states [72] or spin squeezed states [73].
In practice, all of the costs rising from measurement schemes used for obtaining prior information about the systems under consideration will increase the overall cost of its implementation in quantum computing devices, which will be brought together in section 5. The cost of tomography schemes are brought together in Table 2.

Pure State Tomography
There exist certain procedures where one is not interested in the full description of the resulting state ρ (e.g., some special cases of the algorithm in [14]). Instead, let us assume that the output of the algorithm is fully codified in the squares of the state's amplitudes, i.e., if |Ψ〉 N m 1 〈m|Ψ〉|m〉 is the output of the algorithm, then all one needs to know is each |〈m|Ψ〉| 2 . More generally, one may be interested in knowing the square of the amplitudes associated to only a subspace of H. An example of this is considered in [74], where it is assumed that the output of the algorithm can be written as where the first qubit is an auxiliary one, |Ψ 0 〉 N m 1 α m |m〉 is the target state (written in terms of the computational basis of the subsystem), |Ψ 1 〉 is an arbitrary state, and N is a normalization constant that may depend on N. The probability of success p corresponds to the probability of the auxiliar qubit to be found in the state |0〉, which may be computed as Moreover, the probability of the system to be found in the state |0〉|m〉 is p m |α m | 2 /N 2 , here assumed to be non-null for every

Tomography scheme
Overall process cost The overall complexity is defined as in [51], given by the number of copies of ρ times the number of gates per measurement.
Frontiers in Physics | www.frontiersin.org November 2021 | Volume 9 | Article 731007 m. As explained in [51], each p m is possible to be estimated by performing M m independent measurements, each measurement requiring one copy of |Ψ〉. After these trials, the probability p m is estimated as p m n m /M m , where n m is the number of occurrences of |0〉|m〉. By statistical arguments, in Ref. [51] it is shown that the number of trials necessary to estimate p m up to a relative precision Δ with probability 1 − ε 5 , denoted by M m (Δ, ε), is bounded as where C(n, ϵ) ≡ 3 Δ 2 log 2 (1/ϵ) does not depend on the system's size. Denoting now the square of the normalized amplitude by β 2 m ≡ |α m | 2 〈Ψ 0 |Ψ 0 〉, then p m |α m | 2 /N 2 p β 2 m , and thus, from Eq. 8, the behavior of M m can be determined from the behavior of p and β 2 m in terms of N. Finally, let's assume that each |α m | 2 goes to 0 at the same rate as N grows, i.e., |α m | 2 O(N −r ) for r > 0 and all m. A particular case of the last occurs when the discrete probability distribution {β 2 m } is fairly uniform, for which r 1. Therefore, since β 2 m |α m | 2 / m |α m | 2 , one has that β 2 m 1/N and from Eq. 8 the number of copies of |Ψ〉 necessary to determine each p m , that can be taken as M max m M m , is such that We conclude that if p has a non-null minimum as a function of N, then the computational complexity of the tomography of all the p i is of order N. Otherwise, one needs to determine the asymptotic behavior of the success probability p as N grows (e.g. Ref. [14]).

OVERALL COMPLEXITY OF IMPLEMENTATION
The overall complexity for implementation of a quantum algorithm accounts for all tasks that must be executed. It must take into account the total resource aspect, such as the number of work and ancilla qubits, represented by the width of the circuit, that could eventually include qRAM systems, as well as the usual gate cost aspect, brought together with the number of measurements. The last accounts for the number of copies times the number of measurements per copy done upon the final state in order to reconstruct its proper statistical averages and features.
Space costs: As discussed in section 2, the preparation of a generic superposition can be done by manipulating the work system, by the application of quantum gates that correspond to the transformations defined by the free parameters of the state. This results in a space cost which corresponds to the dimension of the work system alone. Assuming that such system has a Hilbert space dimension corresponding to a n-qubit space, it results in O(log 2 (N)) qubits needed for its implementation. The Divide-and-Conquer scheme requires a circuit width which have a space cost of O(N) for implementation, but it is worth noting that it makes use of ancilla qubits that are left entangled with the work system. The discussed schemes for qRAM have similar aspects of qubit resources, but the presence of routing and O(N) qutrits (although this is not the number of activated qutrits during a memory call) in the BB architecture makes it less favorable for the implementation of gate-based algorithms for computation.
Gate or time costs: For the analysis of the corresponding overall gate complexity of an implementation, we need to consider also the amount of identical copies of ρ needed for its proper reconstruction, given a determined scheme for the task [51]. The overall cost of these schemes will appear as a multiplying factor in the full time cost analysis, since all the operations in the implementation of the quantum algorithm, from preparation to readout, should be done this corresponding number of times.
Preparation: The overall time cost of the preparation step depends on whether it is implemented by operating directly on the work system based on the free parameters of the state, or by queries made upon a previously prepared quantum RAM device 6 . With preparation based on the free parameters, the amount of quantum operations has the upper bound of O(N) for preparing a N-dimensional quantum superposition. The Divide-and-Conquer quantum algorithm can create an entangled superposition between ancilla and work systems, with a O(log 2 2 (N)) circuit depth. The Bucket-Brigade qRAM architecture [36] also presents O(log 2 2 (N)) time steps, as discussed in section 2. The preparation implemented via FF-qRAM scheme is fully based on the quantum circuit computation model, without any routing algorithm to address the memory cells that must be queried throughout the transformation represented by Eq. 1. The number of gate operations in the FF-qRAM sums up to O(log 2 (N)) [37].
Evolution: We define the expression evolution to denote the process in which the previously prepared work system is evolved to its last configuration, which could represent, for instance, the solution of a system of linear equations [11], a system of coupled differential equations [14], among other examples of possible applications for quantum computation. The quantum algorithm is composed by a sequence of defined steps and operations, which transforms the initial state under linear operations, that can be controlled by ancilla qubits that compose the full system under consideration. The evolution process will be denoted here as a linear map, represented by ε, as in Ref. [49]. The gate and resource costs of a given algorithm depend on the tasks that may be executed through its implementation, so different quantum algorithms have distinct space and time costs. To represent generically the time cost of the processing step of the algorithm, we will define a function C(ε), of which one excludes the steps of preparation and measurement of the quantum states. 5 This exactly means that | p m − p m |/|p m | ≤ Δ with probability 1 − ε. 6 The complexity of preparing a quantum RAM device is beyond the scope of the present work.

Readout:
The readout aspect must bring the analysis of the number of gates per measurement necessary to characterize a Ndimensional quantum system. For both SQTP and AAPT-JSM, O(log 2 (N)) single qubit operations must be implemented in order to reconstruct the density matrix. For AAPT-MUB based schemes, one needs O(log 2 2 (N)) [O(log 3 2 (N))] single-and two-qubit gates, given that nonlocal [local] correlations occur in the system. The POVM scheme gate cost scales as O(N 4 ) [49] operations per measurement. There are, also, particular methods of reconstruction for ρ, such as QPCA [67] and Compressedsensing, which are capable of reconstructing the density matrix with a number of gates up to O(R log 2 (N)) and O(RNlog 2 2 (N)) respectively, where R stands for the rank of the density matrix under reconstruction [65]. For systems which are permutationally invariant, the PI tomography scheme presents a measerement cost which scales quadratically with the number of qubits of the composed system [66,67]. The PI method also presents approximate results of the density matrix being measured when the system is not invariant under permutations. For the application of those techniques, some knowledge of ρ must be needed, such as the existence of larger eigenvalues in some regions of the composed Hilbert space [67] and sparsity of ρ. Since we assume that no prior information about ρ is known, we shall not discuss these in the overall complexity analysis.
Overall Complexity: The overall gate cost for implementation of a quantum algorithm will now be classified according to each of the techniques discussed in the previous sections, including preparation and measurement schemes. The first multiplicative factors in each of the bounds presented stands for the number of experimental samples needed for each measurement scheme, which will be O(N 4 log 2 (N)) for both SQTP and JSM, O(N 2 log 2 2 (N))[log 3 2 (N)] for MUB, and O(1) for POVM. We will not bring to this particular analysis the QPCA and Compressed-Sensing methods, since we suppose no further information (like the rank R) of the density matrix is known. For each of the considered preparation methods, the free parameter has the upper bound of O(N) operations, while both of the divide-and-conquer algorithm and the BB-qRAM architecture present the same upper bound of O log 2 2 N quantum operations for preparing a state in a generic superposition. Using FF-qRAM, this bound is improved to O(log 2 (N)) operations. The evolution cost is generically represented by the function C(ε). These information are brought all together in Table 3. We also present the possible choices of state preparation and measurement schemes suitable for tasks often approached by circuit-based quantum algorithms in Table 4.

CONCLUSION
We have presented a theoretic overview of the total complexity for the implementation of circuit-based quantum algorithms, involving the codification of the system parameters in the initial state of the work/register qubits, the evolution step towards the final state encoding the solution of the problem and the readout of this solution. A comparison between several schemes of preparation of input states as well as of tomography of final states was provided.
It is important to notice that algorithms that depend on the preparation of input states as superpositions of the basis states have at least O Ñ gate operations based on the number of free parameters, N, defined by the initial state of the work qubits. Once a FF-qRAM device is available, this complexity can be reduced to O log 2 Ñ , which means to be linear in the number of qubits.
The evolution step can be represented by a linear map ε of the initial state to the final state. Its time cost, C(ε), is strongly dependent on the quantum algorithm, and usually shows an exponential speedup compared to the classical algorithm solving the same problem. The origin of such speedup comes from the nature of the Hilbert space, i.e., the ability of a given number of qubits to encode an exponential number of states. Concerning the readout of the solution encoded in the final state, we have done a generic analysis assuming a fairly uniform probability distribution over the basis states of the Hilbert space. In this case, if the desired result is encoded in a single amplitude of a given basis state, the number of required ensemble copies will scale as O(N) in the best scenario. This means a cost that is at least exponential in number of qubits. It is also important to mention that expectation values of observables which