Parallel signal processing system and method

11322171 · 2022-05-03

    Inventors

    Cpc classification

    International classification

    Abstract

    A system and method for processing a plurality of channels, for example audio channels, in parallel is provided. For example, a plurality of telephony channels are processed in order to detect and respond to call progress tones. The channels may be processed according to a common transform algorithm. Advantageously, a massively parallel architecture is employed, in which operations on many channels are synchronized, to achieve a high efficiency parallel processing environment. The parallel processor may be situated on a data bus, separate from a main general-purpose processor, or integrated with the processor in a common board or integrated device. All, or a portion of a speech processing algorithm may also be performed in a massively parallel manner.

    Claims

    1. A method for processing signals, comprising: (a) receiving data representing a time slice of a stream of time-sequential information for each of a plurality of streams of time-sequential information; (b) automatically performing at least one transform process on the received time slice of the stream of time-sequential information for each of the plurality of streams of time-sequential information, to produce transformed data, with at least one single-instruction, multiple-data type parallel processor having a plurality of processing cores concurrently executing the same at least one transform process for each respective stream of time-sequential information, under a common set of instructions; (c) making at least one decision based on the transformed data of each time slice, with at least one single-instruction, multiple-data type parallel processor having a plurality of processing cores concurrently executing the same at least one transform process for each respective stream of time-sequential information, under a common set of instructions; and (d) communicating information representing the decision through a digital communication interface.

    2. The method according to claim 1, wherein each stream of time-sequential information comprises audio information, and the decision is dependent on audio information within the respective stream of time-sequential information.

    3. The method according to claim 1, wherein the at least one transform comprises a speech recognition primitive.

    4. The method according to claim 1, wherein the decision is made based on information in a single stream of time-sequential information, independent of information contained in the other streams of time-sequential information.

    5. The method according to claim 1, wherein the decision is made by a respective processing core of the at least one single-instruction, multiple-data type parallel processor having a plurality of processing cores for each respective time slice dependent solely on information in that respective time slice.

    6. The method according to claim 1, wherein the at least one transform process is selected from the group consisting of a time-to-frequency domain transform algorithm, a wavelet domain transform algorithm, and a Goertzel filter algorithm.

    7. The method according to claim 1, wherein the decision is made by the at least one single-instruction, multiple-data type parallel processor and represents a determination whether an in-band signal is present in a respective time slice.

    8. The method according to claim 1, wherein the plurality of streams of time-sequential information comprise a plurality of different streams of time-sequential information, each different stream of time-sequential information comprising a stream of audio information which is processed in parallel by the at least one single-instruction, multiple-data type parallel processor, and the decision is made based on the at least one transform process in parallel by the at least one single-instruction, multiple-data type parallel processor having the plurality of processing cores executing concurrently under the common set of instructions.

    9. The method according to claim 8, the common set of instructions controls the at least one single-instruction, multiple-data type parallel processor to perform at least a portion of a speech recognition process.

    10. The method according to claim 1, wherein the common set of instructions comprises program instructions to perform a telephony task.

    11. The method according to claim 1, wherein the at least one transform process comprises a Fourier transform.

    12. The method according to claim 1, wherein the at least one single-instruction, multiple-data type parallel processor comprises a multiprocessor having a common instruction decode unit for the plurality of processing cores, each processing core having a respective arithmetic logic unit, all arithmetic logic units within a respective multiprocessor being adapted to concurrently execute the instructions of the instruction sequence on the time slices of the plurality of streams of time-sequential information representing a plurality of digitized real-time analog channels.

    13. A non-transitory computer readable medium storing instructions for controlling a programmable processor to perform a method, comprising: (a) instructions for receiving data representing a plurality of respective time slices of a plurality of parallel streams of time-sequential information; (b) a common set of transform instructions for concurrently performing at least one transform process on the received plurality of respective time slices of the plurality of parallel streams of time-sequential information in parallel to produce respective transformed data for each respective time slice, with at least one single-instruction, multiple-data type parallel processor having a plurality of processing cores executing concurrently under the common set of transform instructions; (c) a common set of decisional instructions for concurrently making at least one decision based on the transformed data, with the at least one single-instruction, multiple-data type parallel processor having the plurality of processing cores executing concurrently under the common set of decisional instructions; and (d) instructions for communicating information representing the decision through a digital communication interface.

    14. The non-transitory computer readable medium according to claim 13, wherein the instructions for making the at least one decision based on the at least one transform process comprise a common set of decision instructions for the at least one single-instruction, multiple-data type parallel processor for concurrently making the at least one decision on the respective time slices of the plurality of parallel streams of time-sequential information in parallel under the common set of decision instructions.

    15. A system for processing streams of information, comprising: (a) an input port configured to receive data representing a plurality of time slices of a plurality of streams of time-sequential information; (b) at least one single-instruction, multiple-data type parallel processor having a plurality of processing cores synchronized to concurrently execute the same instruction, configured to: perform a transform process on the plurality of time slices to produce transformed data, the transform process being performed by concurrent execution of a common set of transform instructions on the plurality of processing cores; and make at least one decision based on the transformed data of the plurality of time slices, the decision being made by concurrent execution of a common set of decision instructions on the plurality of processing cores; and (c) an output port configured to communicate information representing the decision through a digital communication interface.

    16. The system according to claim 15, wherein: the plurality of streams of time sequential information comprise signals digitized at a sampling rate, and the decision is dependent on values of the signals digitized at the sampling rate.

    17. The system according to claim 16, wherein the plurality of streams of time-sequential information comprise a plurality of audio streams, and the at least one decision comprises a determination of whether an in-band audio signal is present in a respective time slice of a respective stream of time-sequential information.

    18. The system according to claim 15, wherein the common set of instructions are adapted to perform a plurality of concurrent tasks selected from the group consisting of an echo processing task, an audio compression task, an audio decompression task, a packet loss recovery task, a wavelet transform processing task, a combined time domain and frequency domain transform processing task, a speech recognition primitive task, and a stream combining task.

    19. The system according to claim 15, wherein the at least one single-instruction, multiple-data type parallel processor comprises a multiprocessor having a common instruction decode unit for the plurality of processing cores, each processing core having a respective arithmetic logic unit, all arithmetic logic units within a respective multiprocessor being adapted to concurrently execute the respective instructions of the common set of instructions.

    20. The system according to claim 15, wherein the single-instruction, multiple-data type parallel processor comprises a Peripheral Component Interconnect Express (PCIe) interface graphic processing unit of a computer system, which operates under control of a central processing unit and receives the plurality of time slices of a plurality of streams of time-sequential information by communication through the Peripheral Component Interconnect Express (PCIe) interface.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    (1) FIG. 1 is a schematic diagram of a system for implementing the invention.

    (2) FIG. 2 is a flowchart of operations within a host processor

    (3) FIG. 3 is a schematic diagram showing operations with respect to a massively parallel co-processor.

    DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

    (4) One embodiment of the present invention provides a system and method for analyzing call progress tones and performing other types of audio band processing on a plurality of voice channels, for example in a telephone system. Examples of call progress tone analysis can be found at: www.commetrex.com/products/algorithms/CPA.html; www.dialogic.com/network/csp/appnots/10117_CPA_SR6_HMP2.pdf;

    (5) whitepapers.zdnet.co.uk/0,1000000651,260123088p,00.htm; and

    (6) www.pikatechnologies.com/downloads/samples/readme/6.2%20-%20Call %20Progress %20Analysis %20-%20ReadMe.txt, each of which is expressly incorporated herein by reference.

    (7) In a modest size system for analyzing call progress tones, there may be hundreds of voice channels to be handled are simultaneously. Indeed, the availability of a general-purpose call progress tone processing system permits systems to define non-standard or additional signaling capabilities, thus reducing the need for out of band signaling. Voice processing systems generally require real time performance; that is, connections must be maintained and packets or streams forwarded within narrow time windows, and call progress tones processed within tight specifications.

    (8) An emerging class of telephone communication processing system, implements a private branch exchange (PBX) switch, which employs a standard personal computer (PC) as a system processor, and employs software which executes on a general purpose operating system (OS).

    (9) For example, the Asterisk system runs on the Linux OS. More information about Asterisk may be found at Digium/Asterisk, 445 Jan Davis Drive NW, Huntsville, Ala. 35806, 256.428.6000 asterisk.org/downloads. Another such system is: “Yate” (Yet Another Telephony Engine), available from Bd. Nicolae Titulescu 10, Bl. 20, Sc. C, Ap. 128 Sector 1, Bucharest, Romania yate.null.ro/pmwiki/index.php?n=Main.Download.

    (10) In such systems, scalability to desired levels, for example hundreds of simultaneous voice channels, requires that the host processor have sufficient headroom to perform all required tasks within the time allotted. Alternately stated, the tasks performed by the host processor should be limited to those it is capable of completing without contention or undue delay. Because digitized audio signal processing is resource intensive, PC-based systems have typically not implemented functionality, which requires per-channel signal processing, or offloaded the processing to specialized digital signal processing (DSP) boards. Further, such DSP boards are themselves limited, for example 8-16 voice processed channels per DSP core, with 4-32 cores per board, although higher density boards are available. These boards are relatively expensive, as compared to the general-purpose PC, and occupy a limited number of bus expansion slots.

    (11) The present invention provides an alternate to the use of specialized DSP processors dedicated to voice channel processing. According to one embodiment, a massively parallel processor as available in a modern video graphics processor (though not necessarily configured as such) is employed to perform certain audio channel processing tasks, providing substantial capacity and versatility. One example of such a video graphics processor is the nVidia Tesla™ GPU, using the CUDA software development platform (“GPU”). This system provides 8 banks of 16 processors (128 processors total), each processor capable of handling a real-time fast Fourier transform (FFT) on 8-16 channels. For example, the FFT algorithm facilitates subsequent processing to detect call progress tones, which may be detected in the massively parallel processor environment, or using the host processor after downloading the FFT data. One particularly advantageous characteristic of implementation of a general purpose FFT algorithm rather than specific call tone detection algorithms is that a number of different call tone standards (and extensions/variants thereof) may be supported, and the FFT data may be used for a number of different purposes, for example speech recognition, etc.

    (12) Likewise, the signal processing is not limited to FFT algorithms, and therefore other algorithms may also or alternately be performed. For example, wavelet-based algorithms may provide useful information.

    (13) The architecture of the system provides a dynamic link library (DLL) available for calls from the telephony control software, e.g., Asterisk. An application programming interface (API) provides communication between the telephony control software (TCS) and the DLL. This TCS is either unmodified or minimally modified to support the enhanced functionality, which is separately compartmentalized.

    (14) The TCS, for example, executes a process which calls the DLL, causing the DLL to transfer a data from a buffer holding, e.g., 2 mS of voice data for, e.g., 800 voice channels, from main system memory of the PC to the massively parallel coprocessor (MPC), which is, for example an nVidia Tesla™ platform. The DLL has previously uploaded to the MPC the algorithm, which is, for example, a parallel FFT algorithm, which operates on all 800 channels simultaneously. It may, for example, also perform tone detection, and produce an output in the MPC memory of the FFT-representation of the 800 voice channels, and possibly certain processed information and flags. The DLL then transfers the information from the MPC memory to PC main memory for access by the TCS, or other processes, after completion.

    (15) While the MPC has massive computational power, it has somewhat limited controllability. For example, a bank of 16 DSPs in the MPC are controlled by a single instruction pointer, meaning that the algorithms executing within the MPC are generally not data-dependent in execution, nor have conditional-contingent branching, since this would require each thread to execute different instructions, and thus dramatically reduce throughput. Therefore, the algorithms are preferably designed to avoid such processes, and should generally be deterministic and non-data dependent algorithms. On the other hand, it is possible to perform contingent or data-dependent processing, though the gains from the massively parallel architecture are limited, and thus channel specific processing is possible. Advantageously, implementations of the FFT algorithm are employed which meet the requirements for massively parallel execution. For example, the CUDA™ technology environment from nVidia provides such algorithms Likewise, post processing of the FFT data to determine the presence of tones poses a limited burden on the processor(s), and need not be performed under massively parallel conditions. This tone extraction process may therefore be performed on the MPC or the host PC processor, depending on respective processing loads and headroom.

    (16) In general, the FFT itself should be performed in faster-than real-time manner. For example, it may be desired to implement overlapping FFTs, e.g., examining 2 mS of data every 1 mS, including memory-to-memory transfers and associated processing. Thus, for example, it may be desired to complete the FFT of 2 mS of data on the MPC within 0.5 mS. Assuming, for example, a sampling rate of 8.4 kHz, and an upper frequency within a channel of 3.2-4 kHz, the 2 mS sample, would generally imply a 256-point FFT, which can be performed efficiently and quickly on the nVidia Tesla™ platform, including any required windowing and post processing.

    (17) Therefore, the use of the present invention permits the addition of call progress tone processing and other per channel signal processing tasks to a PC based TCS platform without substantially increasing the processing burden on the host PC processor, and generally permits such a platform to add generic call progress tone processing features and other per channel signal processing features without substantially limiting scalability.

    (18) Other sorts of parallel real time processing are also possible, for example analysis of distributed sensor signals such as “Motes” or the like. See, en.wikipedia.org/wiki/Smartdust. The MPC may also be employed to perform other telephony tasks, such as echo cancellation, conferencing, tone generation, compression/decompression, caller ID, interactive voice response, voicemail, packet processing and packet loss recovery algorithms, etc.

    (19) Similarly, simultaneous voice recognition can be performed on hundreds of simultaneous channels, for instance in the context of directing incoming calls based on customer responses at a customer service center. Advantageously, in such an environment, processing of particular channels maybe switched between banks of multiprocessors, depending on the processing task required for the channel and the instructions being executed by the multiprocessor. Thus, to the extent that the processing of a channel is data dependent, but the algorithm has a limited number of different paths based on the data, the MPC system may efficiently process the channels even where the processing sequence and instructions for each channel is not identical.

    (20) FIG. 1 shows a schematic of system for implementing the invention.

    (21) Massively multiplexed voice data 101 is received at network interface 102. The network could be a LAN, Wide Area Network (WAN), Prime Rate ISDN (PRI), a traditional telephone network with Time Division Multiplexing (TDM), or any other suitable network. This data may typically include hundreds of channels, each carrying a separate conversation and also routing information. The routing information may be in the form of in-band signaling of dual frequency (DTMF) audio tones received from a telephone keypad or DTMF generator. The channels may be encoded using digital sampling of the audio input prior to multiplexing. Typically voice channels will come in 20 ms frames.

    (22) The system according to a preferred coprocessor embodiment includes at least one host processor 103, which may be programmed with telephony software such as Asterisk or Yate, cited above. The host processor may be of any suitable type, such as those found in PCs, for example Intel Pentium Core 2 Duo or Quadra, or AMD Athlon X2. The host processor communicates via shared memory 104 with MPC 105, which is, for example 2 GB or more of DDR2 or DDR3 memory.

    (23) Within the host processor, application programs 106 receive demultiplexed voice data from interface 102, and generate service requests for services that cannot or are desired not to be processed in real time within the host processor itself. These service requests are stored in a service request queue 107. A service calling module 108 organizes the service requests from the queue 107 for presentation to the MPC 105.

    (24) The module 108 also reports results back to the user applications 106, which in turn put processed voice data frames back on the channels in real time, such that the next set of frames coming in on the channels 101 can be processed as they arrive.

    (25) FIG. 2 shows a process within module 108. In this process, a timing module 201 keeps track of a predetermined real time delay constraint. Since standard voice frames are 20 ms long, this constraint should be significantly less than that to allow operations to be completed in real time. A 5-10 ms delay would very likely be sufficient; however, a 2 ms delay would give a degree of comfort that real time operation will be assured. Then, at 202, e blocks of data requesting service are organized into the queue or buffer. At 203, the service calling module examines the queue to see what services are currently required. Some MPC's, such as the nVidia Tesla™ C870 GPU, require that each processor within a multiprocessor of the MPC perform the same operations in lockstep. For such MPC's, it will be necessary to choose all requests for the same service at the same time. For instance, all requests for an FFT should be grouped together and requested at once. Then all requests for a Mix operation might be grouped together and requested after the FFT's are completed—and so forth. The MPC 105 will perform the services requested and provide the results returned to shared memory 104. At 204, the service calling module will retrieve the results from shared memory and at 205 will report the results back to the application program. At 206, it is tested whether there is more time and whether more services are requested. If so, control returns to element 202. If not, at 207, the MPC is triggered to sleep (or be available to other processes) until another time interval determined by the real time delay constraint is begun, FIG. 3 shows an example of running several processes on data retrieved from the audio channels. The figure shows the shared memory 104 and one of the processors 302 from the MPC 105. The processor 302 first retrieves one or more blocks from the job queue or buffer 104 that are requesting an FFT and performs the FFT on those blocks. The other processors within the same multiprocessor array of parallel processors are instructed to do the same thing at the same time (on different data). After completion of the FFT, more operations can be performed. For instance, at 304 and 305, the processor 302 checks shared memory 104 to see whether more services are needed. In the examples given, mixing 304 and decoding 305 are requested by module 109, sequentially. Therefore, these operations are also performed on data blocks retrieved from the shared memory 104. The result or results of each operation are placed in shared memory upon completion of the operation, where those results are retrievable by the host processor.

    (26) In the case of call progress tones, these three operations together: FFT, mixing, and decoding, will determine the destination of a call associated with the block of audio data for the purposes of telephone switching.

    (27) If module 108 sends more request for a particular service than can be accommodated at once, some of the requests will be accumulated in a shared RAM 109 to be completed in a later processing cycle. The MPC will be able to perform multiple instances of the requested service within the time constraints imposed by the loop of FIG. 2. Various tasks may be assigned priorities, or deadlines, and therefore the processing of different services may be selected for processing based on these criteria, and need not be processed in strict order.

    (28) It is noted that the present invention is not limited to nVidia Tesla® parallel processing technology, and may make use of various other technologies. For example, the Intel Larrabee GPU technology, which parallelizes a number of P54C processors, may also be employed, as well as ATI CTM technology (ati.amd.com/technology/streamcomputing/index.html, ati.amd.com/technology/streamcomputing/resources.html, each of which, including linked resources, is expressly incorporated herein by reference), and other known technologies.

    (29) The following is some pseudo code illustrating embodiments of the invention as implemented in software. The disclosure of a software embodiment does not preclude the possibility that the invention might be implemented in hardware.

    Embodiment 1

    (30) The present example provides computer executable code, which is stored in a computer readable medium, for execution on a programmable processor, to implement an embodiment of the invention. The computer is, for example, an Intel dual core processor-based machine, with one or more nVidia Tesla® compatible cards in PCIe x16 slots, for example, nVidia C870 or C1060 processor. The system typically stores executable code on a SATA-300 interface rotating magnetic storage media, i.e., a so-called hard disk drive, though other memory media, such as optical media, solid state storage, or other known computer readable media may be employed. Indeed, the instructions may be provided to the processors as electromagnetic signals communicated through a vacuum or conductive or dielectric medium. The nVidia processor typically relies on DDR3 memory, while the main processor typically relies on DDR2 memory, though the type of random-access memory is non-critical. The telephony signals for processing may be received over a T1, T3, optical fiber, Ethernet, or other communications medium and/or protocol.

    (31) Data Structures to be Used by Module 108

    (32) RQueueType Structure // Job Request Queue

    (33) ServiceType

    (34) ChannelID // Channel Identifier

    (35) VoiceData // Input Data

    (36) Output // Output Data

    (37) End Structure

    (38) // This embodiment uses a separate queue for each type of service to be requested.

    (39) // The queues have 200 elements in them. This number is arbitrary and could be adjusted

    (40) // by the designer depending on anticipated call volumes and numbers of processors available

    (41) // on the MPC. Generally, the number does not have to be as large as the total of number

    (42) // of simultaneous calls anticipated, because not all of those calls will be requesting services

    (43) // at the same time.

    (44) RQueueType RQueueFFT[200] // Maximum of 200 Requests FFT

    (45) RQueueType RQueueMIX[200] // Maximum of 200 Requests MIX

    (46) RQueueType RQueueENC[200] // Maximum of 200 Requests ENC

    (47) RQueueType RQueueDEC[200] // Maximum of 200 Requests DEC

    (48) Procedures to be Used by Module 108

    (49) // Initialization Function

    (50) Init: Initialize Request Queue Initialize Service Entry Start Service Poll Loop
    // Service Request Function

    (51) ReqS: Case ServiceType FFT: Lock RQueueFFT Insert Service Information into RQueueFFT Unlock RQueueFFT MIX: Lock RQueueMIX Insert Service Information into RQueueMIX Unlock RQueueMIX ENC: Lock RQueueENC Insert Service Information into RQueueENC Unlock RQueueENC DEC: Lock RQueueDEC Insert Service Information into RQueueDEC Unlock RQueueDEC End Case Wait for completion of Service Return output
    // Service Poll Loop
    // This loop is not called by the other procedures. It runs independently. It will keep track of
    // where the parallel processors are in their processing. The host will load all the requests for a
    // particular service into the buffer. Then it will keep track of when the services are completed
    // and load new requests into the buffer.
    //SerPL:
    Get timestamp and store in St

    (52) // Let's do FFT/FHT

    (53) Submit RQueueFFT with FFT code to GPU

    (54) For all element in RQueueFFT Signal Channel of completion of service

    (55) End For

    (56) // Let's do mixing

    (57) Submit RQueueMIX with MIXING code to GPU

    (58) For all element in RQueueMIX Signal Channel of completion of service

    (59) End For

    (60) // Let's do encoding

    (61) Submit RQueueENC with ENCODING code to GPU

    (62) For all element in RQueueENC Signal Channel of completion of service

    (63) End For

    (64) // Let's do decoding

    (65) Submit RQueueDEC with DECODING code to GPU

    (66) For all element in RQueueDEC Signal Channel of completion of service

    (67) End For

    (68) // Make sure it takes the same amount of time for every pass

    (69) Compute time difference between now and St

    (70) Sleep that amount of time

    (71) Goto SerPL // second pass

    (72) Examples of Code in Application Programs 106 for Calling the Routines Above

    (73) Example for Calling “Init”

    (74) // we have to initialize PStar before we can use it

    (75) Call Init

    (76) Example for Requesting an FFT

    (77) // use FFT service for multitone detection

    (78) Allocate RD as RQueueType

    (79) RD.Service=FFT

    (80) RD.ChannelID=Current Channel ID

    (81) RD.Input=Voice Data

    (82) Call ReqS(RD)

    (83) Scan RD.Output for presence of our tones

    (84) Example for Requesting Encoding

    (85) // use Encoding service

    (86) Allocate RD as RQueueType

    (87) RD.Service=ENCODE

    (88) RD.ChannelID=Current Channel ID

    (89) RD.Input=Voice Data

    (90) Call ReqS(RD)

    (91) // RD.Output contains encoded/compressed data

    (92) Example for Requesting Decoding

    (93) // use Decoding service

    (94) Allocate RD as RQueueType

    (95) RD.Service=DECODE

    (96) RD.ChannelID=Current Channel ID

    (97) RD.Input=Voice Data

    (98) Call ReqS(RD)

    (99) // RD.Output contains decoded data

    Embodiment 2

    (100) The second embodiment may employ similar hardware to Embodiment 1.

    (101) // this Embodiment is Slower, but Also Uses Less Memory than Embodiment #1 Above

    (102) Data Structures to be Used by Module 108

    (103) RQueueType Structure // Job Request Queue ServiceType ChannelID // Channel Identifier VoiceData // Input Data Output // Output Data

    (104) End Structure

    (105) // This embodiment uses a single queue, but stores other data in a temporary queue

    (106) // when the single queue is not available. This is less memory intensive, but slower.

    (107) RQueueType RQueue[200] // Maximum of 200 Requests

    (108) Procedures to be Used by Module 108

    (109) // Initialization Function

    (110) Init: Initialize Request Queue Initialize Service Entry Start Service Poll Loop

    (111) // Service Request Function

    (112) ReqS: Lock RQueue Insert Service Information into RQueue Unlock RQueue Wait for completion of Service Return output

    (113) // Service Poll Loop

    (114) // to run continuously

    (115) SerPL: Get timestamp and store in St // Let's do FFT/FHT For all element in RQueue where SerivceType=FFT Copy Data To TempRQueue End For Submit TempRQueue with FFT code to GPU For all element in TempRQueue Move TempRQueue.output to RQueue.output Signal Channel of completion of service End For // Let's do mixing For all element in RQueue where SerivceType=MIXING Copy Data To TempRQueue End For Submit TempRQueue with MIXING code to GPU For all element in RQueue Move TempRQueue.output to RQueue.output Signal Channel of completion of service End For // Let's do encoding For all element in RQueue where SerivceType=ENCODE Copy Data To TempRQueue End For Submit TempRQueue with ENCODING code to GPU For all element in RQueue Move TempRQueue.output to RQueue.output Signal Channel of completion of service End For // Let's do decoding For all element in RQueue where SerivceType=DECODE Copy Data To TempRQueue End For Submit TempRQueue with DECODING code to GPU For all element in RQueue Move TempRQueue.output to RQueue.output Signal Channel of completion of service End For // Make sure it takes the same amount of time for every pass Compute time difference between now and St Sleep that amount of time Goto SerPL // second pass
    Examples of Code in the Application Programs 106 for Calling the Routines Above
    Example for Calling “init”

    (116) // we have to initialize PStar before we can use it

    (117) Call Init

    (118) Example for Calling “FFT”

    (119) // use FFT service for multitone detection

    (120) Allocate RD as RQueueType

    (121) RD.Service=FFT

    (122) RD.ChannelID=Current Channel ID

    (123) RD.Input=Voice Data

    (124) Call ReqS(RD)

    (125) Scan RD.Output for presents of our tones

    (126) Example for Calling Encoding

    (127) // use Encoding service

    (128) Allocate RD as RQueueType

    (129) RD.Service=ENCODE

    (130) RD.ChannelID=Current Channel ID

    (131) RD.Input=Voice Data

    (132) Call ReqS(RD)

    (133) // RD.Output contains encoded/compressed data

    (134) Example for Calling Decoding

    (135) // use Decoding service

    (136) Allocate RD as RQueueType

    (137) RD.Service=DECODE

    (138) RD.ChannelID=Current Channel ID

    (139) RD.Input=Voice Data

    (140) Call ReqS(RD)

    (141) // RD.Output contains decoded data

    (142) While the embodiment discussed above uses a separate host and massively parallel processing array, it is clear that the processing array may also execute general purpose code and support general purpose or application-specific operating systems, albeit with reduced efficiency as compared to an unbranched signal processing algorithm. Therefore, it is possible to employ a single processor core and memory pool, thus reducing system cost and simplifying system architecture. Indeed, one or more multiprocessors may be dedicated to signal processing, and other(s) to system control, coordination, and logical analysis and execution. In such a case, the functions identified above as being performed in the host processor would be performed in the array, and, of course, the transfers across the bus separating the two would not be required.

    (143) From a review of the present disclosure, other modifications will be apparent to persons skilled in the art. Such modifications may involve other features which are already known in the design, manufacture and use of telephony engines and parallel processing and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present application also includes any novel feature or novel combination of features disclosed herein either explicitly or implicitly or any generalization thereof, whether or not it mitigates any or all of the same technical problems as does the present invention. The applicants hereby give notice that new claims may be formulated to such features during the prosecution of the present application or any further application derived therefrom.

    (144) The word “comprising”, “comprise”, or “comprises” as used herein should not be viewed as excluding additional elements. The singular article “a” or “an” as used herein should not be viewed as excluding a plurality of elements. The word “or” should be construed as an inclusive or, in other words as “and/or”.