APPARATUS AND METHOD FOR CONTROLLING TRANSMISSION POWER BASED ON REINFORCEMENT LEARNING
20220408375 · 2022-12-22
Assignee
Inventors
Cpc classification
H04W52/241
ELECTRICITY
Y02D30/70
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A method of controlling transmission power for wireless communication includes obtaining detected transmission power; generating a state variable and a reward variable of a reinforcement learning model based on the detected transmission power, threshold transmission power, and a channel state; and training a reinforced learning agent based on the state variable and the reward variable to output an action variable of the reinforcement learning model representing the transmission power.
Claims
1. A method of controlling transmission power for wireless communication, the method comprising: obtaining detected transmission power; generating a state variable and a reward variable based on the detected transmission power, a threshold transmission power, and a channel state; and training a reinforced learning agent based on the state variable and the reward variable to output an action variable representing the transmission power.
2. The method of claim 1, wherein the generating the state variable and the reward variable comprises: calculating a transmission power residual rate of a unit period based on the threshold transmission power and the detected transmission power.
3. The method of claim 2, wherein the generating of the state variable and the reward variable comprises: obtaining an environment variable based on at least one communication parameter indicating the channel state; and calculating the state variable based on the transmission power residual rate and the environment variable.
4. The method of claim 2, wherein the generating the state variable and the reward variable comprises: calculating the reward variable as a positive value based on the transmission power residual rate and the channel state when the transmission power residual rate is positive.
5. The method of claim 4, wherein the calculating of the reward variable comprises: calculating an average error rate during the unit period; and calculating the reward variable based on the transmission power residual rate and the average error rate.
6. The method of claim 1, wherein the training of the reinforced learning agent comprises generating, by the reinforced learning agent, the action variable based on the state variable and the reward variable, and the generating of the action variable comprises randomly generating the action variable with a probability ε, and greedily generating the action variable with a probability (1-ε).
7. The method of claim 6, wherein the training of the reinforced learning agent further comprises: gradually reducing the probability ε.
8. The method of claim 6, wherein the greedily generating of the action variable comprises: setting a range of transmission power based on a transmission power of a previous unit period; calculating a plurality of Q-values of Q-learning respectively corresponding to a plurality of transmission power candidates included in the range of transmission power; selecting one transmission power candidate from among the transmission power candidates based on the plurality of Q-values; and generating the action variable and updating a Q-table based on the selected transmission power candidate.
9. The method of claim 8, wherein the range of transmission power includes the transmission power of the previous unit period.
10. The method of claim 8, wherein the selecting the transmission power candidate comprises: applying a weight to at least one of the plurality of transmission power candidates, the weight equal to or less than the threshold transmission power; and selecting, as the selected transmission power candidate, a transmission power candidate corresponding to the largest sum of a weight and a Q-value from among the transmission power candidates.
11. The method of claim 1, wherein the threshold transmission power is defined based on a specific absorption rate (SAR).
12. The method of claim 1, further comprising: adjusting the transmission power based on the action variable.
13. An apparatus comprising: a memory configured to store instructions; and at least one processor configured to communicate with the memory and, by executing the instructions, control transmission power for wireless communication, wherein, to control the transmission power, the at least one processor is configured to obtain detected transmission power; generate a state variable and a reward variable based on the detected transmission power, a threshold transmission power, and a channel state; and train a reinforced learning agent based on the state variable and the reward variable to output an action variable representing the transmission power.
14. The apparatus of claim 13, wherein the at least one processor is configured to calculate a transmission power residual rate of a unit period based on the threshold transmission power and the detected transmission power to generate the state variable and the reward variable.
15. The apparatus of claim 13, wherein, to train the reinforced learning agent, the at least one processor is further configured to: set a range of transmission power based on a transmission power of a previous unit period, calculate a plurality of Q-values of Q-learning respectively corresponding to a plurality of transmission power candidates included in the range of transmission power, select one transmission power candidate from among the transmission power candidates based on the plurality of Q-values, and generate the action variable and update a Q-table based on the selected transmission power candidate.
16. A method of controlling transmission power for wireless communication, the method comprising: obtaining detected transmission power; and training a reinforced learning agent, based on the detected transmission power, a threshold transmission power, and a channel state, to output an action variable representing the transmission power, wherein the training of the reinforced learning agent comprises setting a range of transmission power based on a transmission power of a previous unit period; calculating a plurality of Q-values of Q-learning respectively corresponding to a plurality of transmission power candidates included in the range of transmission power; selecting one transmission power candidate from among the plurality of transmission power candidates based on the plurality of Q-values; and generating the action variable and updating a Q-table based on the selected transmission power candidate.
17. The method of claim 16, wherein the range of transmission power comprises the transmission power of the previous unit period.
18. The method of claim 17, wherein the selecting of the transmission power candidate comprises: applying weights to at least one of the plurality transmission power candidates, the weights equal to or less than the threshold transmission power; and selecting, as the selected transmission power candidate, a transmission power candidate corresponding to the largest sum of a weight and a Q-value from among the transmission power candidates.
19. The method of claim 16, wherein the threshold transmission power is defined based on a specific absorption rate (SAR).
20. The method of claim 16, further comprising: adjusting the transmission power based on the action variable.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] Some example embodiments of the inventive concepts will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0021] Hereinafter, some example embodiments of the technical idea of the inventive concepts will be described in detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and repeated descriptions thereof are omitted.
[0022]
[0023] A base station (BS) 1 may generally refer to a fixed station that communicates with a user equipment (UE) and/or other base stations. The base station 1 may, for example, exchange data and control information with the UE and/or the other base stations by communicating with the UE and/or the other base stations. In some example embodiments, the BS 1 may be referred to as a Node B, an evolved-Node B (eNB), a next generation Node B (gNB), a sector, a site, a base transceiver system (BTS), an access point (AP), a relay node, a remote radio head (RRH), a radio unit (RU), a small cell, etc. Herein, a BS or a cell may be understood as a comprehensive term indicating a portion and/or a function covered by a base station controller (BSC) in the CDMA, a Node-B in the WCDMA, an eNB in the LTE, a gNB in the 5G, and/or a sector (site); and may include various coverage areas like a megacell, a macrocell, a microcell, a picocell, a femtocell a relay node, an RRH, an RU, and/or a small cell communication range.
[0024] The UE 100 may refer to equipment that is stationary and/or mobile and which may communicate with a base station (e.g., the BS 1), to transmit and/or receive data and/or control information. For example, the UE 100 may be referred to as a terminal, a terminal equipment, a mobile station (MS), a mobile terminal (MT), a user terminal (UT), a subscriber station, a wireless device, a handheld device, etc. Hereinafter, example embodiments will be described primarily with reference to the UE 100 as a wireless communication device, but it will be understood that the example embodiments are not limited thereto.
[0025] A wireless communication network between the UE 100 and the BS 1 may support communication by multiple users by sharing available network resources. For example, in a wireless communication network, information may be transmitted in various multiple access schemes (such as code division multiple access (CDMA), frequency division multiple access (FDMA), time division multiple access (TDMA), orthogonal frequency division multiple access (OFDMA), single carrier frequency division multiple access (SC-FDMA), OFDM-FDMA, n OFDM-TDMA, OFDM-CDMA, and/or the like). As shown in
[0026] The UE 100 may include an antenna 120, a transceiver 140, and processing circuitry 160, as shown in
[0027] The antenna 120 may receive a signal transmitted by the BS 1 and/or may output a signal to be transmitted to the BS 1. In some embodiments, the antenna 120 may be and/or include an antenna array including a plurality of antennas (e.g., for multiple-input multiple-output (MIMO)). In some embodiments, the antenna 120 may include a phased array for beam forming.
[0028] The transceiver 140 may process a signal received through the antenna 120 and/or a signal to be transmitted through the antenna 120. For example, the transceiver 140 may include at least one RX path for processing respective radio frequency (RF) signals received through the antenna 120 in a reception mode and at least one TX path for generating respective RF signals to be transmitted through the antenna 120 in a transmission mode. In some embodiments, an RX path may include a low noise amplifier (LNA), a filter, a mixer, etc., whereas a TX path may include a power amplifier (PA), a filter, a mixer, etc. As shown in
[0029] The power detector 142 may detect (and/or measure) the power (e.g., transmission power) of a signal output from the transceiver 140 to the antenna 120. For example, the power detector 142 may detect transmission power by detecting the power of a signal fed back through an RX path not used in the transmission mode. As shown in
[0030] In a high frequency band like a millimeter wave (mmWave) band, a short-wavelength signal may have a strong straightness, and thus, the quality of communication may depend conditions of on the path of the uplink UL (and/or downlink DL). For example, the quality may be affected by interruption (e.g., by an obstacle) and/or by the orientation of an antenna. Therefore, in some wireless communication systems using a high frequency band for increasing throughput, a transmitter may compensate by using high transmission power. Also, when the antenna 120 includes a plurality of antennas for beam forming, spatial diversity, polarization diversity, spatial multiplexer, etc., and/or the UE 100 supports simultaneous access to two or more wireless communication systems (e.g., dual connectivity) total radiated power (TRP) output from the UE 100 may increase. Therefore, a user of the UE 100 may be exposed to high-density electromagnetic waves during, e.g., an uplink UL transmission.
[0031] Metrics like a specific absorption rate (SAR) and a maximum permissible exposure (MPE) may be used to define a save limit for the energy absorbed by a human body due to electromagnetic waves, and organizations like the Federal Communications and Commissions (FCC) of the United States of America may regulate the upper limits of values that wireless communication devices have to comply with. For example, an upper limit of energy measured from the UE 100 (e.g., for a certain measurement period) may be set, and the measurement period may vary according to, e.g., the frequency band. Therefore, the UE 100 may limit an average of the output energy during a measurement period, even when it is allowed to use high transmission power for a short period. Hereinafter, some example embodiments will be mainly described with reference to the SAR of electromagnetic waves, and the values that the UE 100 has to comply with will be referred to as SAR conditions.
[0032] The processing circuitry 160 may extract information provided by the BS 1 from a signal received from the transceiver 140 in the reception mode. For example, the processing circuitry may be configured to extract information from a payload of the BS 1. The processing circuitry 160 may also provide information to be transmitted to the BS 1 in the transmission mode, e.g., a signal including a payload of the UE 100, to the transceiver 140. In some embodiments, the processing circuitry 160 may include and/or be included in hardware including logic circuits; a hardware/software combination such as a processor executing software; or a combination thereof. For example, the processing circuitry may include, and/or be included in, at least one of programmable components like a central processing unit (CPU) and a digital signal processor (DSP), reconfigurable components like a field programmable gate array (FPGA), and/or components that provide fixed functions like an intelligent property (IP) core. In some embodiments, the processing circuitry 160 may be referred to as a communication processor, a baseband processor, a modem, etc.
[0033] In some embodiments, the processing circuitry 160 may control transmission power based on reinforced learning. For example, as shown in
[0034] In some embodiments, an optimal transmission power to comply with the SAR conditions and provide desired communication quality may be derived by the reinforcement learning model 162. To this end, the reinforcement learning model 162 may derive the transmission power of a unit period based on the detected transmission power P.sub.DET, the SAR conditions, and/or a channel state. Therefore, a user's exposure to electromagnetic waves may be effectively reduced while reducing and/or minimizing the potential degradation of wireless communication quality. Also, in spite of variations of wireless communication devices, transmission power efficiency may be improved and/or optimized for each wireless communication device based on individual reinforced learning.
[0035]
[0036] In some embodiments, the transmission power may be controlled in each unit period within the maximum transmission power P.sub.LIM. For example, as shown in
[0037] The total transmission power during n consecutive unit periods may be required to satisfy the SAR conditions, and thus, an average of n pieces of transmission power respectively corresponding to the n unit periods may be less than or equal to the threshold transmission power P.sub.THR.
[0038] For example, in some embodiments, the transmission power of a unit period may be controlled based on the threshold transmission power P.sub.THR. For example, when unit periods (e.g., U1, U2, U3, etc.) each having transmission power less than the threshold transmission power P.sub.THR are consecutive, the margin of transmission power in unit periods following the corresponding unit periods may increase. On the other hand, when unit periods (e.g., U5, Un+1, Un+2, Un+3, etc.) each having transmission power greater than the threshold transmission power P.sub.THR are consecutive, the margin of transmission power in unit periods following the corresponding unit periods may decrease. Hereinafter, with reference to the drawings, examples of the operation of determining the optimal transmission power in a unit period based on a reinforcement learning model will be described.
[0039]
[0040] The agent 320 may receive a state and a reward from the environment 340 and may provide an action to the environment 340. The agent 320 may be trained to provide an action corresponding to the maximum reward in a state received from the environment 340. For example, the agent 320 may include a Q (quality)-table 322 and may be trained by updating the Q-table 322 based on a reward received from the environment 340. The Q-table 322 may include Q-values including immediate rewards and the maximum values of future rewards for combinations of states and actions, respectively. In some embodiments, as described below with reference to
[0041] In some embodiments, the environment 340 may generate a state and a reward based on the detected transmission power P.sub.DET, the threshold transmission power P.sub.THR, and/or a channel state. Also, the agent 320 may generate an action indicating the transmission power of a unit period based on a state provided from the environment 340 and the Q-table 322. Therefore, the agent 320 may be trained to determine the optimal transmission power considering both the SAR condition and communication quality in the equipment in which the reinforcement learning model 300 is implemented (e.g., in the UE 100 of
[0042]
[0043] Referring to
P.sub.OUT=∫.sub.0.sup.UP.sub.DETdt [Equation 2]
[0044] In operation S400, a state variable and a reward variable may be generated. For example, the environment 340 may generate a state variable and a reward variable based on the output transmission power P.sub.OUT, the threshold transmission power P.sub.THR, and/or a channel state. Examples of operation S400 will be described later with reference to
[0045] In operation S600, the agent 320 may be trained. For example, the agent 320 may be trained to generate an action variable representing transmission power based on the state variable and the reward variable generated in operation S400. In some embodiments, a range of transmission power may be set, and the agent 320 may be trained based on an action variable corresponding to the optimal transmission power within the set range as well as an environment variable and a reward variable corresponding to the action variable. In some embodiments, the agent 320 may generate an action variable for maximum reward (e.g., greedily) and/or may generate an action variable randomly. An example of the operation S600 will be described below with reference to
[0046] In operation S800, the transmission power may be adjusted. For example, the processing circuitry 160 may identify a magnitude of transmission power corresponding to an action variable provided from the agent 320 trained in operation S600 and control the transceiver 140 based on the identified magnitude of transmission power, thereby adjusting the transmission power. For example, in some embodiments, the output power of a power amplifier included in the transceiver 140 may be adjusted, and thus, the transmission power may be adjusted accordingly.
[0047]
[0048] Referring to
[0049] In operation S420, a state variable may be calculated. As described above with reference to
[0050] The first term on the right side of [Equation 3] may correspond to the percentage of a difference (e.g., residual transmission power) between the threshold transmission power P.sub.THR and the output transmission power P.sub.OUT with respect to the threshold transmission power P.sub.THR and may be referred to herein as a transmission power residual rate. Accordingly, when the transmission power in a unit period is less than the threshold transmission power P.sub.THR, the residual transmission power and the transmission power residual rate may be positive values. Meanwhile, when the transmission power exceeds the threshold transmission power P.sub.THR in the unit period, the residual transmission power and the transmission power residual rate may be negative values. As shown in [Equation 3], the state variable s.sub.t may correspond to the sum of the transmission power residual rate and an environment variable θ, and may be based on both the SAR condition and the channel state.
[0051]
[0052] Referring to
[0053] When positive residual transmission power is determined, a positive reward variable may be calculated in operation S440. When positive residual transmission power is generated due to the transmission power determined by an action variable of the agent 320, the environment 340 may provide a positive reward to the agent 320. An example of operation S440 will be described below with reference to
[0054] When zero or negative residual transmission power is determined, the reward variable may be set to zero and/or a negative reward variable may be calculated in operation S450. When zero residual transmission power and/or residual transmission power less than a predefined (and/or otherwise determined) positive reference value is generated due to the transmission power determined by the action variable of the agent 320 (e.g., when transmission is impossible due to a radio link failure (RLF) and/or the like), the environment 340 may provide zero reward and/or a negative reward to the agent 320. Therefore, the agent 320 may be trained to generate positive residual transmission power.
[0055]
[0056] Referring to
[0057] In operation S444, an average error rate of a unit period may be calculated. For example, the environment 340 may obtain error rates occurred in transmission during the unit period and may calculate an average of obtained error rates. In some embodiments, the environment 340 may calculate an average of block error rates (BLER) of a physical uplink shared channel (PUSCH) of a unit period. The average error rate of the unit period may represent a channel state, and, as described below, the reward variable may decrease as the average error rate of the unit period increases.
[0058] In operation S446, a reward variable may be calculated. For example, the environment 340 may calculate the reward variable r.sub.t based on [Equation 4] below.
r.sub.t=(P.sub.THR−P.sub.OUT)−C×B.sub.AVG [Equation 4]
[0059] In [Equation 4], B.sub.AVG may denote the average error rate of the unit period, and a correlation coefficient C may have a value, such that a positive reward variable r.sub.t is obtained. Therefore, the reward variable r.sub.t may increase as the margin of the transmission power increases and the channel state is better and may decrease as the margin of the transmission power decreases and the channel state is worse.
[0060]
[0061] Referring to
[0062] In operation S630, a random number may be compared with a reference value ε. The reference value ε may be included in the range of the random number generated in operation S610, and the probability that the random number exceeds the reference value ε may depend on the size of the reference value ε. For example, when a random number is generated in the range from 0 to 1 and the reference value ε is 0.5, the probability that the random number exceeds the reference value ε may be approximately 0.5. As shown in
[0063] When the random number is less than or equal to the reference value ε, an action variable may be randomly generated in operation S650. When the agent 320 is repeatedly trained to generate an action variable corresponding to the highest reward variable (that is, greedy), the trained agent 320 may generate a locally optimal action variable. Therefore, the agent 320 may randomly generate an action variable with a particular probability, that is, the probability that the random number is less than or equal to the reference value ε. In some embodiments, the agent 320 may generate an action variable based on the random number generated in operation S610. Also, in some embodiments, the agent 320 may randomly generate an action variable within the range of transmission power to be described below with reference to
[0064] When the random number is greater than the reference value ε, an action variable may be greedily generated in operation S670. For example, the agent 320 may be trained to receive an immediate reward and a maximum future reward and may update a Q-table based on [Equation 5] below.
[0065] In [Equation 5], β is a learning rate and may have a value between 0 and 1 (0≤β≤1). When β=0, the agent 320 may not be trained. ρ is a discount factor and may have a value between 0 and 1 (e.g., 0≤ρ≤1). When ρ=0, future rewards may not be considered. The agent 320 may generate an action variable capable of maximizing a Q-value, e.g., as defined in [Equation 5]. Therefore, when the random number and the reference value ε are in the range from 0 to 1, an action variable may be randomly generated with a probability ε and may be greedily generated with a probability (1-ε). In this regard, the reinforcement learning model 300 may control the transmission power based on Q-learning. An example of operation S670 will be described below with reference to
[0066] In operation S690, the reference value ε may be decreased. For example, the agent 320 may decrease the reference value ε by being repeatedly trained. Therefore, as learning progresses, the probability that the random number exceeds the reference value ε may decrease, and thus, the probability that an action variable is randomly generated may decrease. As a result, a rate at which the action variable is randomly generated may be high at the beginning of learning, whereas a rate at which the action variable is greedily generated may become high as the learning progresses. Accordingly, an action variable at may be defined as in [Equation 6] below.
[0067] In [Equation 6], R denotes a random number, and A denotes the range of transmission power as described below with reference to
[0068]
[0069] Referring to
[0070] In operation S674, a plurality of Q-values may be calculated. For example, the agent 320 may calculate a plurality of Q-values respectively corresponding to a plurality of transmission power candidates included in the range of transmission power set in operation S672. Therefore, a plurality of Q-values respectively corresponding to a plurality of actions in the current state may be calculated.
[0071] In operation S676, a transmission power candidate may be selected. For example, the agent 320 may select one transmission power candidate from among the transmission power candidates included in the range of transmission power based on the Q-values calculated in operation S674. An example of operation S676 will be described below with reference to
[0072] In operation S678, an action variable may be generated and the Q-table may be updated. For example, the agent 320 may generate an action variable corresponding to the transmission power candidate selected in operation S676, generate a Q-value based on the generated action variable, and reflect the generated Q-value to the Q-table.
[0073]
[0074] In some embodiments, a range A of transmission power may be defined based on detected transmission power of a previous unit period. For example, as shown in
[0075]
[0076] Referring to
[0077] In operation S676_2, a transmission power candidate P.sub.i may be compared with the threshold transmission power P.sub.THR. As shown in
[0078] When the transmission power candidate P.sub.i is equal to or less than the threshold transmission power P.sub.THR, a weight may be applied in operation S676_3. For example, a weight may be applied to a transmission power candidate less than or equal to the threshold transmission power P.sub.THR from among a plurality of transmission power candidates included in the range of transmission power (e.g., a transmission power candidate included in the range B of
[0079] In operation S676_4, the variable i may be compared with the number m of transmission power candidates. As shown in
[0080] In operation S676_6, the transmission power candidate having the largest sum of a Q-value and the weight may be selected. As described above, a Q-value may include an immediate reward and the maximum value of a future reward, and a weight may be selectively applied. A sum of a Q-value and a weight may be calculated for each of a plurality of transmission power candidates included in a range of transmission power, and a transmission power candidate corresponding to the largest sum from among the transmission power candidates may be selected.
[0081]
[0082] The at least one processor 11 may execute a series of instructions. For example, the at least one processor 11 may execute instructions stored in the memory sub-system 15 or the storage 17. Also, the at least one processor 11 may load instructions from the memory sub-system 15 and/or the storage 17 into an internal memory and execute loaded instructions. In some embodiments, the at least one processor 11 may perform at least some of the operations described above with reference to the drawings by executing instructions. In some embodiments, the at least one processor 11 may be and/or include the processing circuitry 160.
[0083] The at least one accelerator 13 may be designed to perform a predefined (and/or otherwise determined) operation at a high speed. For example, the at least one accelerator 13 may load data stored in the memory sub-system 15 and/or the storage 17, and store data generated by processing loaded data into the memory sub-system 15 and/or the storage 17. In some embodiments, the at least one accelerator 13 may perform at least some of the operations described above with reference to the drawings at a high speed. For example, the at least one accelerator 13 may be and/or include a machine learning (ML) (and/or artificial intelligence (AI)) accelerator.
[0084] The memory sub-system 15 may be a non-transitory storage device and may be accessed by the at least one processor 11 and/or the at least one accelerator 13 through the bus 19. In some embodiments, the memory sub-system 15 may include a volatile memory like dynamic random access memory (DRAM) and static random access memory (SRAM) and may also include a non-volatile memory like flash memory and resistive random access memory (RRAM). In some embodiments, the memory sub-system 15 may store instructions and data for performing at least some of the operations described above with reference to the drawings.
[0085] The storage 17 may be a non-transitory storage device and may be configured to not lose stored data even when power supply is cut off. For example, the storage 17 may include a semiconductor memory device like flash memory or any storage medium like a magnetic disk or an optical disc. In some embodiments, the storage 17 may store instructions, a program, and/or data for performing at least some of the operations described above with reference to the drawings.
[0086] While the inventive concepts have been particularly shown and described with reference to some example embodiments thereof, it will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the following claims.