GPU Advice [Deep Sky] Processing techniques · Timothy Martin · ... · 48 · 1325 · 7

AccidentalAstronomers 18.64
...
· 
·  Share link
If money were no object, what graphics card would you get for a processing computer? I'm working with the guys at Puget Systems and they recommended dual RTX 4090s with 24GB each. Will tensorflow.dll even take advantage of two 4090s? Or is there a better option?
Like
smcx 3.61
...
· 
·  2 likes
·  Share link
I would wait for the next generation.

or if it’s really not an option, rtx6000
Edited ...
Like
ChuckNovice 8.21
...
· 
·  4 likes
·  Share link
A properly configured StarXTerminator and BlurXTerminator will make use of GPU/CUDA and process a massive 9576x6388 pixels image (ASI6200MM) in under a minute with a 2080TI.

I see few use case for 2x 4090:
- You're a deep learning researcher/enthusiast that writes/uses models.
- You make videos and need a rendering farm.
- You're a gamer and don't care if one of the two GPU does next to nothing on most games.

If money really isn't an issue and you really like tech, then yes 2x 4090 is a massive beast but I would add a performing air conditioner to the bill.

To answer your tensorflow question more specifically, when you write your model there's few different strategies you can tell tensorflow to use, including distributed work on multiple GPUs. So yes tensorflow is very good at distributed work, in fact this is one of the reasons it has long been a favorite for production use. Though, if you're using someone else's model/code, there is no guarantee they wrote it in a way that it's supported without major modifications.
Edited ...
Like
KGoodwin 4.71
...
· 
·  4 likes
·  Share link
My 4070 Super handles drizzled 2x images with all the GPU enabled processes in Pixinsight pretty quickly.  I certainly would not spend a bunch of money on GPUs for astro processing right now.  I'd get even more RAM (I have 128GB), more, larger NVMe drives, and a Threadripper or similar processor with tons of cores instead.  At least with Pixinsight GPUs just don't make enough of a difference to justify something more than a midrange one.
Like
John.Dziuba 1.51
...
· 
·  Share link
Hi Tim.  For now anyway, it is only the AI processes like StarX and BlurX that really benefit from a powerfull GPU.  

I recently added a rebuilt RTX3090 for this reason which is an economical option.   On my heafty Threadripper machine, a StarXerminator job on a full frame 1x1 used to take about 120 seconds. It now takes about 20 seconds.   

Perhaps the latest generation RTX would cut that to 15 seconds?  The point is, IMO spending a lot of money on the latest GPU will not buy you a meaningful performance gain.  

JD
Like
AccidentalAstronomers 18.64
Topic starter
...
· 
·  1 like
·  Share link
John Dziuba:
The point is, IMO spending a lot of money on the latest GPU will not buy you a meaningful performance gain.


Then I'll probably go with a single 4090. My current machine has a 3090 in it and it does okay. But faster is always better. The biggest time-suck by far is stacking. I'm looking at a 96-core Threadripper with 786GB of RAM, which should help that considerably.
Like
KGoodwin 4.71
...
· 
·  Share link
Timothy Martin:
The biggest time-suck by far is stacking. I'm looking at a 96-core Threadripper with 786GB of RAM, which should help that considerably.


Are you doing lucky imaging? The more time I spend on a project the more subs I have to stack, but also I do it less frequently since each project takes longer to complete. I don’t mind so much if I need to wait a couple hours for stacking once a week or every other week.
Like
John.Dziuba 1.51
...
· 
·  1 like
·  Share link
Hi Tim.  For now anyway, it is only the AI processes like StarX and BlurX that really benefit from a powerfull GPU.  

I recently added a rebuilt RTX3090 for this reason which is an economical option.   On my heafty Threadripper machine, a StarXerminator job on a full frame 1x1 used to take about 120 seconds. It now takes about 20 seconds.   

Perhaps the latest generation RTX would cut that to 15 seconds?  The point is, IMO spending a lot of money on the latest GPU will not buy you a meaningful performance gain.  

JD

Timothy Martin:
John Dziuba:
The point is, IMO spending a lot of money on the latest GPU will not buy you a meaningful performance gain.


Then I'll probably go with a single 4090. My current machine has a 3090 in it and it does okay. But faster is always better. The biggest time-suck by far is stacking. I'm looking at a 96-core Threadripper with 786GB of RAM, which should help that considerably.

That will do it. I have 128 core on a Linux machine.  RAM is always good but 786GB may be extreme unless you are working with medium format images. I have 256GB and and I have never seen more than 180GB or so be used at any one time during a large WBPP job.
Like
KGoodwin 4.71
...
· 
·  Share link
You guys must really be stacking a lot of images.  1000+ maybe?  I think your Pixinisght computers must cost a fortune.  128 cores or 96 cores…786GB or RAM…I have 20 cores and feel it's pretty speedy, definitely not feeling slow enough that I'd want to drop $15k+ on a computer that will depreciate to 0 in 5 years.  Stacking 800 or so subs has not made me want to cry yet on 20 cores and 128GB of RAM.
Like
AccidentalAstronomers 18.64
Topic starter
...
· 
·  Share link
Kyle Goodwin:
Are you doing lucky imaging? The more time I spend on a project the more subs I have to stack, but also I do it less frequently since each project takes longer to complete. I don’t mind so much if I need to wait a couple hours for stacking once a week or every other week.


I've got a C11 here at home that I'm doing lucky imaging with. But that's not my concern, really. I've got a travel scope as well as three remote rigs. Even gathering 30+ hours on each target, it starts to pile up. And ironically, the small scope data takes the longest to stack--sometimes 2 or more days with 1,400 subs or more--because it's 2X drizzled and I'm applying LNC to convergence, which takes forever.
Like
AccidentalAstronomers 18.64
Topic starter
...
· 
·  Share link
John Dziuba:
RAM is always good but 786GB may be extreme unless you are working with medium format images.


In fact, I'm running two Moravian C5s--medium format sensors. And I'm running one C3 and one ASI6200--both full-frame sensors. Both data sets from the C3 and the ASI6200 get drizzled 2X, so that gets to be pretty large.
Like
KGoodwin 4.71
...
· 
·  Share link
Alright, yeah you’re working at a whole different scale from what I considered possible for amateurs and I can see why you’d want to go faster.
Like
AccidentalAstronomers 18.64
Topic starter
...
· 
·  Share link
Kyle Goodwin:
You guys must really be stacking a lot of images.  1000+ maybe?


Not always, but often. but I've been shooting Markarian's chain since early February with the TOA. 2,660 subs and 184.3 hours so far--128 hours of that Ha. I'm hoping to get to 150 hours Ha before it drops below the horizon. In a good winter week, with four scopes going, I could easily collect 200 hours of data total. So yes, it's a lot.
Like
AccidentalAstronomers 18.64
Topic starter
...
· 
·  2 likes
·  Share link
Kyle Goodwin:
not feeling slow enough that I'd want to drop $15k+ on a computer that will depreciate to 0 in 5 years.


Five years is a long time to me. I'll be 70 in five years. Money, I've got. Time is another matter. Saving a few minutes here and there may not seem like much, but it adds up. As I get closer to the septuagenarian mark, I want to spend as little time as possible watching an hourglass turn on a computer screen.
Like
KGoodwin 4.71
...
· 
·  1 like
·  Share link
I wonder how long it will be before there is GPU-based preprocessing and stacking in Pixinsight.  I'm betting that is going to massively speed things up at some point.  Maybe there are other tools that have it already, I don't know.
Like
Alan_Brunelle
...
· 
·  2 likes
·  Share link
Timothy Martin:
Kyle Goodwin:
not feeling slow enough that I'd want to drop $15k+ on a computer that will depreciate to 0 in 5 years.


Five years is a long time to me. I'll be 70 in five years. Money, I've got. Time is another matter. Saving a few minutes here and there may not seem like much, but it adds up. As I get closer to the septuagenarian mark, I want to spend as little time as possible watching an hourglass turn on a computer screen.

The impression I get is that gains in processing times with the latest or most powerful computers comes with rapidly diminishing returns on investment.  Be it processor, RAM, etc.  Given your workflow and multiple projects, I wonder if running multiple somewhat more modest machines at the same time might be more productive for the same cost of one expensive processing beast?  I'm not saying use crap machines.  Also, each machine could be tailored for data that is crunching larger or smaller sub files.
Like
mish 0.00
...
· 
·  Share link
Hi Tim,

Sounds like you are putting together a beast

My short answer is if money really is no object I’d use lambdalabs and look at an rtx 6000 build


Longer answer: I build ML/AI models for a living and have a less powerful machine than this (2x4090 but only 128GB ram and a 14something something cpu). You need an awful lot of GPU vRAM when training models but very little when running them, that’s down to all the things you have to track during training 


There are also some weird things about the build they have recommended to you, most notably if you are in the US this will probably blow your fuse box a lot. A 4090 can pull over 450W and a threadripper can do something similar- this gets you very close to what a domestic circuit carries in the US. If you have a nice audio system you will trip the fuse. An extremely high end system like this can actually be hard to use/maintain. An rtx 6000 pulls 300W and has double the vRAM of the 4090.


Tensorflow will have no issues with 2 GPUs


I don't know enough about how WBPP works to say for sure, but it may be that it doesn't make sense to put on a GPU - some sequence based algorithms just can't take advantage of what GPUs are good at


Having said all that, the last thing I'd say is a big part of the time taken to process something in eg blurx is loading the model, for a 4080 it may take 12s to process an image, for 2x4090 it may even take longer (it depends a lot on your operating system but loading a model onto two GPUs and running can be weird)


My personal recommendation would be to get one GPU, either a 4080/90 or an RTX 6000 (important not an a6000, that's an old model)
Edited ...
Like
jrista 11.18
...
· 
·  3 likes
·  Share link
Understand that a SINGLE 4090 is a MASSIVELY POWERFUL device. It is radically more capable than a single 3090/Ti. Far more cores on every level. Much faster clock rates. A huge L2 cache (compared to the 3090 at least, which had a tiny L2 cache). It  delivers over double the TFLOPs as the 3090/Ti. Remember that these things are designed to render 4k or 8k frames using full blown ray tracing (path tracing) at reasonable frame rates thanks to the DLSS3 AI. The amount of effort that goes into rendering every pixel, 60 times a second, is immense.

I would at the very least just start out with one, and see how it goes, as it should go pretty darn good. I've had a 3090 for quite some time, and it is a very powerful device, but the 4090 runs circles around it. It doesn't break a sweat on much of anything, until you enable ray tracing, then the power of the 4090 really kicks in (for games, at least). For GPGPU kind of processing, I doubt you'll even use a lot of the most powerful aspects the chip has to offer (which are lergely bound up in the ray tracing capabilities.) A lot of the capabilities are also tied up in the DLSS3 AI, and I don't know how useful that might be for astro processing. 

For stacking performance…do you use a PCIE nVME drive? I suspect that reducing "disk" latency and increasing read and write speed would offer the greatest performance benefits for stacking. More than packing a monstrous amount of ram into the system would. The biggest issue with stacking is raw data throughput, not through system memory, but from much slower storage to system memory, and then out of system memory again (once a pixel stack is done, you don't need that data in memory anymore). With the number of pixels in our images these days, I don't think you could parallelize enough to use 768GB of memory… You would likely be bottlenecked by parallelism limits and source data throughput limits.

You could look to using PCIe 5 nVME drives, if you can find some really fast ones. Benchmarks on those indicate they are hitting over 12GB/s throughput. RAM Discs a few years ago were hitting 7gb/s or so, but even PCIE 4 is that fast now. PCIE 5 is 12GB/s or so, with the theoretical max around 14GB/s…and that could help cut down that key bottleneck. I would think that getting the fastest physical storage device you can would be a better investment than 768GB of ram, for stacking performance.
Edited ...
Like
Alan_Brunelle
...
· 
·  Share link
I find this thread to be interesting and will continue to see what the experts here have to say on the matter.  What my current understanding is (and limited by my lack of the necessary computer background at this level) is that PI just does not support the use of GPU processing as a general rule.  Yes, there are some functions within PI that do make use of GPU processing with AI applied, but mostly this is at the level of post processing of the stack.  And preprocessing is where a big chunk of dead time is encountered.  And while preprocessing, you really cannot work on any other thing, unless you have another computer.  Some of these GPU-assisted functions can allow for batch processing, if one chooses to apply a process to individual subs.  Not sure I have seen anyone actually do this, however. 

Stacking and all that this entails appears to still be a linear process, with PI's computational demands working with the CPU and whatever cores are available.  So therefore my 5 YO intel machine clearly addresses only the CPU cores during the preprocessing.  The GPU spends its time dealing with my video output, which does exactly nothing during that processing.  I recall reading on the PI site some years ago that the makers at PI are aware of the advantages of GPU processing, are looking into how they can implement it, but that at this time they do not see implementing GPU assist broadly.  So I will monitor this thread to try to understand if that changes in the coming months.  I certainly hope it does.  It is my understanding that adopting a broad GPU assist within PI would require licensing with the GPU makers.  And that means dealing with trying to satisfy all us users' needs and desires regarding the GPU preferences.  So multiple licenses.  I know when my Nvidia GPU updates, 95% of all the updates are mostly about inclusion of new games into their architecture.  That would likely be required of PI as well.  And that would cost dollars.  There is a reason that Nvidia is the largest company on the planet now!  $$$  PI is a small player compared to the game industry.  So unless all the GPU manufacturers give PI a pass, I would not hold my breath.  Understand the use of my GPU by BXT, etc. is a hack.

As it stands currently, I can see that my CPUs 16 cores are all involved during calibration, cosmetic correction, debayer, registration, local normalization, etc.  This is evident during the file writing that goes on and one can easily see that PI feeds 16 image files to the CPU (assume that each addresses a chunk of RAM as well) and the cores crunch away independently on each file.  During the first traunch of 16 images, the output can be seen to be saved and that these cores do not all generate output at the same time, nor in sequence.  This may be due to file differences, effort to complete the job, and/or the amount of power allowed by my overclock settings to each core.  So at this time, a processor with more cores would seem to be the better option.  I am not sure if an Intel "core" is faster than an AMD "core", so that might actually be a factor, given the option for more cores, etc.  The use of RAM is probably limited by the number of cores that can be used simultaneously, and PI seems to be designed to use all the cores at its disposal.  Perhaps those with 96 or more cores can speak up as to whether they can see all their cores being active simultaneously?  PI uses swap files, and for me, getting better performance with what I have has to do with getting the best swap file configuration (locations, numbers, etc.).  I was able to get my metrics to be tops for the reported benchmarks when I did that.  During my first year, my benchmarks were exceeded by some of the threadripper machines, but not by huge amounts.  I chose to build my machine conservatively and not buy the most expensive CPU, motherboard, etc.  It paid off, because it performed nearly as well as the newest machines for half the cost (under $1500).  To reiterate, buying the top/latest PC gear is only incrementally better in performance, but for a huge premium in cost.  And when two such machines go obsolete 5 years down the line, anyone looking at the two machines side-by-side would probably not even think they were even different!

I find it a bit amusing to hear people talk of using the latest super GPUs to accelerate their jobs.  For me, doing a BXT run on my old 2060 gets the job done in under a minute.  NXT in under 20 seconds on my 1.2 GB files.  Waiting to hear how much faster the super GPUs work, and why the saving of even 20 sec (meaning a process time of 0 sec) matters to someone…  For me, when using BXT, I get to read an email or two, maybe…!

I know of no way to run PI in multiple instances on one machine.  At least in any way that would be considered efficient.  If it could do that, then maybe a monster PC would be worth it.  Hence my suggestion above that it might be more productive to have multiple decent (but not monster) machines to allow for simultaneous processing.

So at this time, I do believe that a processor with more cores and threads is beneficial (caveat to how well the cores are used by PI), and sufficient RAM to work with those cores.  For RAM, I upped my amount.  Made no difference, because the amount I had was already sufficient.  So before just pumping RAM into the machine, understand how it is being utilized by PI.  One thing I did with my extra RAM was put some of my swap files onto a RAM disk.  This certainly improved the benchmarks of my machine.  However, it also seemed to generate some instabilities when using PI, so I reverted to using my NVME SSD for that.  Time savings not worth the overall improvement, which was incremental.  Fast NVME drives for the swap files is also a benefit, for the cost of read/write times, reasons stated above.  However, just choosing the fastest NVME drives comes with caveats.  Please read and understand about the lifetimes of these drives and what impacts that lifetime.  The fastest drives have greatly layered architectures and this limits the number of read/write cycles per location on the drive.  This is typically not an issue for most computer users.  But here PI seems to put particular demand on the use of such drives during processing.  Because of where I live, I am certainly not a power user of my processing computer.  But I easily can fill my 1 TB drive doing the preprocessing of just a few of my subjects.  And during the preprocessing, PI is writing and re-writing a good number of the files generated during the processing.  However, all thing considering, if I was a power user, I would still opt for the fastest drive possible, use an NVME drive for processing that was separate from my boot drive, and then just replace it after a while if testing showed it to be decaying in abillity.  These tend to not be that expensive.  

Please let me know if PI ever gets on the GPU bandwagon.
Edited ...
Like
smcx 3.61
...
· 
·  1 like
·  Share link
To be honest, the biggest time saver is to use something other than pixinsight. 

¯\_(ツ)_/¯
Like
KGoodwin 4.71
...
· 
·  1 like
·  Share link
Your assumptions about licensing for GPU usage are incorrect. The GPU compute architecture for NVidia (CUDA) as well as similar systems for AMD as well as the standardized one like OpenCL and DirectCompute are freely usable by any developer. There is no need for “inclusion in a driver” either. Games show up there because they add optimizations for certain games sometimes, but not because you have to pay NVidia to be able to use the GPU.
Like
Alan_Brunelle
...
· 
·  Share link
Kyle Goodwin:
Your assumptions about licensing for GPU usage are incorrect. The GPU compute architecture for NVidia (CUDA) as well as similar systems for AMD as well as the standardized one like OpenCL and DirectCompute are freely usable by any developer. There is no need for “inclusion in a driver” either. Games show up there because they add optimizations for certain games sometimes, but not because you have to pay NVidia to be able to use the GPU.

Maybe I am incorrect.  But my words reflect those of what I read on the PI forums by "well-known" people there .  I certainly could believe that Nvidia might provide free support to game companies to add value to their hardware.  On the other hand, the pressure for game companies to pay for such attention is not only logical or life-sustaining, but seems to be the way the world works.  So if you absolutely know that you are correct, i.e. work in the management of these companies that do the negotiations, then fine, but at this point, I would not bet on that.  I know that you did not mean that I said that drivers are needed or asked for.  If my memory serves me, my Nvidia updates typically state that "Now includes support for...  or Now with updated support for...  and then followed by a very long list in small print of the companies and games newly supported.
Like
KGoodwin 4.71
...
· 
·  1 like
·  Share link
Alan Brunelle:
Kyle Goodwin:
Your assumptions about licensing for GPU usage are incorrect. The GPU compute architecture for NVidia (CUDA) as well as similar systems for AMD as well as the standardized one like OpenCL and DirectCompute are freely usable by any developer. There is no need for “inclusion in a driver” either. Games show up there because they add optimizations for certain games sometimes, but not because you have to pay NVidia to be able to use the GPU.

Maybe I am incorrect.  But my words reflect those of what I read on the PI forums by "well-known" people there .  I certainly could believe that Nvidia might provide free support to game companies to add value to their hardware.  On the other hand, the pressure for game companies to pay for such attention is not only logical or life-sustaining, but seems to be the way the world works.  So if you absolutely know that you are correct, i.e. work in the management of these companies that do the negotiations, then fine, but at this point, I would not bet on that.  I know that you did not mean that I said that drivers are needed or asked for.  If my memory serves me, my Nvidia updates typically state that "Now includes support for...  or Now with updated support for...  and then followed by a very long list in small print of the companies and games newly supported.

I run a video streaming company. We’re a vendor to all the major cable and telephone companies. We use GPUs in our products for video processing. I’m certain I understand the license requirements surrounding developing applications with them.
Like
AccidentalAstronomers 18.64
Topic starter
...
· 
·  1 like
·  Share link
Alan Brunelle:
For me, doing a BXT run on my old 2060 gets the job done in under a minute.


Given that I have two scopes using IMX455 sensors--both drizzled to 2X--and two scopes using the IMX461, it takes me a bit longer than that even with the 3090. The real time suck is SXT. And on more complex images, I may need to run that five times (once each for L, the combined RGB, a clone of R used for continuum subtraction, Ha, and a clone of RGB to actually produce the stars). It takes around 2 to 3 minutes each. So let's say I cut that time to one minute average (maybe that's ambitious), that's conservatively 5 minutes saved processing a single image. I typically process each image around five times as I gather data on a target. So that's now a savings of 25 minutes saved. With four scopes cranking every clear night, I should produce about 200 images per year, that's 80+ hours saved per year. That's not accounting for any time saved on BXT and NXT. It would also smooth out the workflow a bit and make it that much more enjoyable. 

I use APP exclusively to stack. So while one image is stacking, I can at least work on another one in PI. If I were to stick with 768GB of RAM, I'm intrigued by the idea of stacking on a RAM disk. Time for stacking is indeed a big concern, but it's secondary to speeding up things in PI.
Like
Alan_Brunelle
...
· 
·  Share link
Kyle Goodwin:
Alan Brunelle:
Kyle Goodwin:
Your assumptions about licensing for GPU usage are incorrect. The GPU compute architecture for NVidia (CUDA) as well as similar systems for AMD as well as the standardized one like OpenCL and DirectCompute are freely usable by any developer. There is no need for “inclusion in a driver” either. Games show up there because they add optimizations for certain games sometimes, but not because you have to pay NVidia to be able to use the GPU.

Maybe I am incorrect.  But my words reflect those of what I read on the PI forums by "well-known" people there .  I certainly could believe that Nvidia might provide free support to game companies to add value to their hardware.  On the other hand, the pressure for game companies to pay for such attention is not only logical or life-sustaining, but seems to be the way the world works.  So if you absolutely know that you are correct, i.e. work in the management of these companies that do the negotiations, then fine, but at this point, I would not bet on that.  I know that you did not mean that I said that drivers are needed or asked for.  If my memory serves me, my Nvidia updates typically state that "Now includes support for...  or Now with updated support for...  and then followed by a very long list in small print of the companies and games newly supported.

I run a video streaming company. We’re a vendor to all the major cable and telephone companies. We use GPUs in our products for video processing. I’m certain I understand the license requirements surrounding developing applications with them.

So it is your company's video processing software that you use to process your videos?  That should be the video software responsibility and the cost built into the cost of the software you use.  Just wanting to be sure of the facts, not that it is all that critical of the conversation here.    

After doing a quick search on line it is very clear that Nvidia certainly limits its licenses for certain tasks.  It can be especially restrictive for commercial applications.  Seems much less so for research and development purposes.  In having to sign up for the developer's newsletters to download the CUDA, etc. for using BXT, its clear that Nvidia is seeking to have as many developers wanting to work on their systems as possible, so that is free, indeed.  The question still remains about whether Nvidia is paid a license fee by commercial developers, especially ones that get the special attention where Nvidia posts that their upgraded firmware now supports a particular game... The fact that Nvidia advertises a "free" Inception program for Startup companies suggest that otherwise fees are involved beyond startup.  That they also appear to offer a number of different partner level programs suggest otherwise as well, unless these partnerships are all one-way.  

Maybe to your point, I too have some doubt as to GPU licensing being a hurdle for PI adopting GPU acceleration more generally.  For PI, maybe the bigger issue is really dealing with the technical issues specific and different for each graphics card manufacturer.
Like
 
Register or login to create to post a reply.