Variable-size sampling method for supporting uniformity confidence under data-streaming environment
10673766 ยท 2020-06-02
Assignee
Inventors
Cpc classification
G06F17/18
PHYSICS
H04L43/00
ELECTRICITY
H04L49/901
ELECTRICITY
H04L47/27
ELECTRICITY
International classification
H04L12/28
ELECTRICITY
G06F17/18
PHYSICS
Abstract
Disclosed is a variable-size sampling method under a data-streaming environment, including: calculating a maximum window size that satisfies a lower limitation of a predetermined uniformity confidence level at all times; inputting a data stream to be sampled; comparing a data stream length input until a current time point with the maximum window size; inspecting a sample size and a sampling fraction if the maximum window size is larger than the data stream length; performing sampling by generating a slot to increase the sample size if the current sample size is smaller than a predetermined percentage (P %) of the data stream; and directly performing sampling without generating a slot if the current sample size is equal to or larger than the predetermined percentage (P %) of the data stream. As a result, degradation of uniformity confidence during variable-size sampling under a real-time streaming environment can be prevented to improve sampling performance.
Claims
1. A variable-size sampling method under a data-streaming environment, comprising: calculating a maximum window size that satisfies a lower limitation () of a predetermined uniformity confidence level at all times; inputting a data stream to be sampled if the maximum window size is calculated; comparing a data stream length of the data stream input until a current time point with the maximum window size; inspecting a sample size and a sampling fraction if the maximum window size is larger than the data stream length; performing sampling by generating a slot to increase the sample size if a current sample size is smaller than a predetermined percentage (P %) of the data stream as a result of the inspection of the sample size and the sampling fraction; and directly performing sampling without generating a slot if the current sample size is equal to or larger than the predetermined percentage (P %) of the data stream as a result of the inspection of the sample size and the sampling fraction.
2. The variable-size sampling method according to claim 1, further comprising: storing the sample generated from a current window in a storage if the maximum window size is smaller than the data stream length; and inspecting the sample size and the sampling fraction by generating a new window.
3. The variable-size sampling method according to claim 2, wherein, if a probability that a data stream element is selected for a slot is higher than a random value generated arbitrarily in performing the sampling, the data stream is inserted into the slot and is replaced with an element stored in an existing slot.
4. The variable-size sampling method according to claim 1, wherein the percentage P % is expressed as p100 where p([0, 1]) denotes the sampling fraction.
5. The variable-size sampling method according to claim 4, wherein the maximum window size satisfying the lower limitation () of the uniformity confidence and the sampling fraction (p) is calculated on the basis of the following formula:
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing and additional features and characteristics of this disclosure will become more apparent from the following detailed description considered with reference to the accompanying drawings, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION
(13) Hereinafter, preferred embodiments of the invention will be described in detail with reference to the accompanying drawings. It is noted that like reference numerals denote like elements throughout overall drawings. In addition, descriptions of well-known apparatus and methods may be omitted so as to not obscure the description of the representative embodiments, and such methods and apparatus are clearly within the scope and spirit of the present disclosure.
(14) The terminology used herein is only for the purpose of describing particular embodiments and is not intended to limit the invention. As used herein, the singular forms a, an, and the may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It is further to be noted that, as used herein, the terms comprises, comprising, include, and including indicate the presence of stated features, integers, steps, operations, units, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, units, and/or components, and/or combination thereof.
(15) Unless specified otherwise, all terminologies used herein including technical or scientific terminologies have the same meanings as those generally appreciated by a person ordinarily skill in the art to which the present invention pertains. Terminologies defined in typical dictionaries should be construed to have meanings matching those described in the context of the related art, and should not be construed as being abnormal or excessively formal unless defined apparently herein.
(16) The present invention will now be described with reference to the accompanying drawings, in which like reference numerals denote like elements throughout the entire specification, and they will not be repeatedly described intentionally. In the following description, any specific word or sentence for the related art will not be provided for simplicity purposes if it unnecessarily obscures the subject matter of the invention.
(17) The present invention relates to a variable-size sampling method for supporting high uniformity confidence (UC) under a data-streaming environment.
(18) According to the present invention, the variable-size sampling method under a data-streaming environment is performed by a terminal device such as a computer, or a control unit or processor that comprehensively controls the terminal device such as a computer. That is, the variable-size sampling method under a data-streaming environment according to the invention refers to an algorithm as a sort of software, and the software algorithm may be executed by a control unit or a processor of a terminal device such as a computer.
(19) That is, the variable-size sampling method under a data-streaming environment according to the invention may be implemented by a control unit that comprehensively controls a computer or a central processing unit (CPU) that processes or control command signals and a series of programs. That is, the variable-size sampling method under a data-streaming environment according to the invention may be implemented as an algorithm or logics as a sort of software, and the software algorithm may be executed by a control unit or a central processing unit of a computer.
(20) Herein, p ([0, 1]) denotes a sampling fraction, and P % denotes the sampling fraction expressed as a percentage of p100.
(21) The K-sample method has two problems relating to the uniformity confidence reduction.
(22) First, the initial uniformity confidence reduction problem in which the uniformity confidence remarkably decreases at the initial stage of the K-sample method is caused by the following two reasons.
(23) First, the initial uniformity confidence reduction problem occurs because a range of the data stream selectable for a particular slot is restrictive. In the K-sample method, the sample size is dynamically incremented by one in order to maintain the sampling fraction for the data stream while the sample size is smaller than P % of the data stream. However, in this case, a range of the data stream that can be stored in each slot is restricted.
(24)
(25) In the example of
(26) Since the range of the stream data selectable for a particular slot is restrictive in this manner, the sampling schema can be expressed as illustrated in
(27) A second reason of the initial uniformity confidence reduction is that the data selected as a sample are transferred to the secondary storage and do not change. The stream data included in the extraction range of the sample slot compete with each other for selection as a sample of the corresponding slot. However, if the range of the data stream selectable as a particular slot is finished, and a new slot is generated, the data selected as a sample in the corresponding slot are transferred to the secondary storage, and it is difficult to change the data.
(28) Since it is difficult to change the data selected as a sample for a particular slot, the number of possible samples that can be generated is reduced. In particular, in the K-sample method, since the number of possible samples generable at the initial stage of the sampling in which the sample size increases to 2 or larger remarkably decreases, the uniformity confidence abruptly decreases. The two reasons of the initial uniformity confidence reduction problem can be specifically described as follows.
(29) Sample range restriction: since the range of the stream data selectable for a particular sample slot is restricted, the uniformity confidence decreases.
(30) Unchangeable past sample: if the range of the stream data selectable for a particular sample slot is finished, the data selected as the sample do not change, so that uniformity confidence decreases.
(31) Next, the reason of the steady uniformity confidence reduction problem by which the uniformity confidence steadily decreases as the number of input streams increases in the K-sample method will be described.
(32) The steady uniformity confidence reduction occurs because an increase of the number of possible samples generable in the K-sample method is smaller than an increase of the number of different samples of the same size possible statistically. As the number of input streams increases, a population size also increases. However, it is difficult to include all of the input streams in the sample extraction range because of a memory loss and a sample range restriction. Therefore, a ratio between the number of different samples of the same size possible statistically and the number of possible samples generable in the K-sample method steadily decreases.
(33)
(34) Referring to
(35) If the data stream length increases to 6, the population size increases, so that the number of different samples of the same size possible statistically becomes 15 (=(6|2)). However, due to the memory loss and the sample range restriction, the number of samples generable in the K-sample method becomes 9 (=(3|1)(3|1)). Therefore, the uniformity confidence in a case where the data stream length is set to 6 decreases to 60% (=(3|1)(3|1)/((6|2))100).
(36) In this manner, the uniformity confidence in a case where the data stream length is set to 9 becomes 32.1% (=(3|1)(3|1)(3|1)/((9|3))100). Therefore, as the number of input streams increases, the uniformity confidence gradually decreases. The reason of the steady uniformity confidence reduction problem may be described specifically as follows.
(37) Increase of sample extraction range: since an increase of the number of possible samples generable in the K-sample method is smaller than an increase of the number of different samples of the same size possible statistically, a ratio between the former and latter numbers of possible samples gradually decreases.
(38) First, the requirement for the sample range restriction as a reason of the initial uniformity confidence reduction problem is called sample range expansion. The sample range expansion means that a range of the stream data extractable for a particular sample slot during the sampling is expanded to the elements of the sample already selected. If the population size considered during the sampling increases, the number of generable possible samples also increases. Therefore, the uniformity confidence increases.
(39)
(40) Referring to
(41) In comparison, referring to
(42) A requirement for addressing the unchangeable past sample problem is called a past sample change scheme. In this scheme, the data already selected as a sample is changeable even after a data stream range that can be extracted as a particular element of the sample is finished. If the data not changeable as it has been already selected as a sample can be changed with another data, this means that the number of extractable possible samples increases. Therefore, the uniformity confidence can be improved by introducing the past sample change scheme.
(43) As illustrated in
(44) Finally, a requirement for addressing an increase of the sample extraction range that generates a steady uniformity confidence reduction problem includes use of a UC-based window. In the use of the UC-based window, a window size by which the uniformity confidence of the sample can be maintained to be equal to or larger than a lower limitation at all times is calculated. Then, the sampling is performed by dividing the input streams in the unit of this window size. The following Formula 2 shows a method of calculating such a window size.
(45) The method of calculating the maximum window size that satisfies the sampling fraction p and the uniformity confidence satisfying the lower limitation can be expressed as the following Formula 2.
(46)
(47) where denotes a lower limitation of the uniformity confidence, p denotes a sampling fraction, k denotes an input stream size until the current time point, and m denotes a maximum input stream size allowable when the sample size is incremented by one.
(48) The Formula 2 can be proved as follows.
(49)
(50) Referring to
(51) Assuming that (kp+1) data are extracted as a sample from the (k+m) stream data, the number of different samples of the same size possible statistically becomes (k+m)|(kp+1). Since the UC K-sample method supports the sample range expansion, (kp+1) samples are extracted from (m+kp) stream data. In addition, since the UC K-sample method supports the unchangeable past sample, x samples can be changed out of the kp samples already extracted in the sampling course. If the number m of the newly input stream data is smaller than kp+1, the number x becomes equal to or smaller than (kp+1m). Therefore, the number x has a range equal to or larger than max{0, (kp+1)m} and equal to or smaller than r.
(52) In summary, the number of possible samples generable through the UC K-sample method can be expressed as follows.
(53)
(54) As expressed in Formula 1, since the uniformity confidence is calculated as a ratio between the number of different samples of the same size possible statistically and the number of different samples of the same size possible with the algorithm, the uniformity confidence of the UC K-sample method can be calculated on the basis of Formula 2. Since the uniformity confidence is required to be larger than the lower limitation at all times when the sample size increases, a sum of the maximum numbers k and m satisfying Formula 2 is set as the window size.
(55)
(56) Referring to
(57)
(58) Referring to
(59) First, a maximum window size satisfying the lower limitation of the uniformity confidence defined by a user at all times is calculated (S101).
(60) If the window size is calculated, data streams to be sampled are input (S103).
(61) Then, the window size and the stream length input until the current time point are compared (S105).
(62) If the window size is larger than the stream length, the sample size and the sampling fraction are inspected (S111).
(63) Otherwise, if the window size is smaller than the stream length, the sample generated using the current window is stored in the secondary storage in order to maintain the uniformity confidence at the lower limitation or higher (S107). In addition, a new window is generated (S109). Here, the new window is created in step S109 in order to maintain the uniformity confidence at the lower limitation or higher.
(64) Then, the sample size and the sampling fraction are inspected (S111). That is, in order to maintain the sampling fraction for the data stream, it is checked whether the current sample size is equal to or larger than P % of the data stream whenever a data stream is input (S111).
(65) If the current sample size is smaller than P % of the data stream, a slot for increasing the sample size is additionally generated (S113), and the sampling is performed (S115).
(66) Otherwise, if the current sample size is equal to or larger than P % of the data stream, the sampling is directly performed without generating the slot (S115).
(67) According to the invention, step S113 is performed to increase the sample size by generating a single slot as a single element memory space added to the sample.
(68) Step S115 is a course of sampling for the current slot. If a probability that the data stream element is selected for the slot is higher than a random value arbitrarily generated, this data stream is inserted into the slot and is replaced with the element stored in the existing slot.
(69)
(70) Referring to
(71) If the stream length sLength input until the current time point is 1, or the stream length wLength input to the current window is larger than the window size, the sample generated until the current time point is stored in the secondary storage. Then, sampling starts again by generating a new window (lines 7 to 9).
(72) As a data stream element is input, a random number is generated for each element (line 10).
(73) If the current sample size is smaller than pwLength, the sample size is incremented by one, and the currently incoming data is added to the sample along with the random number in order to maintain the sampling fraction for the data stream size (lines 11 to 13).
(74) Otherwise, if the current sample size is not smaller than (pwLength), an element having the smallest random number out of the sample is compared with the random number of the currently incoming element in order to inspect whether or not the current element can be input to the existing sample (line 16).
(75) If the random number of the current data is larger than the smallest random number of the sample, the element having the smallest random number is removed, and the current data is inserted into the sample (lines 16 and 17). This process is repeated until the stream input operation is completed, or a user interrupts the sampling (lines 4 to 19).
(76) Under a streaming environment in which data are created in real time, it is difficult to perform sampling for all of the data due to a memory constraint problem. Instead, the sampling is performed by using only the data stored in the memory at the current time point as a sample extraction range. Therefore, it is necessary to provide a factor for maintaining the sampling performance. According to the present invention, the uniformity confidence can serve as a criterion for determining sampling performance, and the UC K-sample can be used to remarkably improve the sampling performance under a streaming environment.
(77) Although exemplary embodiments of the present invention have been shown and described, it will be apparent to those having ordinary skill in the art that a number of changes, modifications, or alterations to the invention as described herein may be made, none of which depart from the spirit of the present invention. All such changes, modifications and alterations should therefore be seen as within the scope of the present invention.